0

我想知道是否可以检测分类数据中的异常值。例如,如果我有列表:

l = ['fresh','good','nice','great','amazing','bad','tasty','funny','dirty']

此列表中的异常值将是字符串'bad''dirty'.

例如,如果我有 10000 行的数据,并且每行是 2 到 10 个单词之间的句子,是否可以用Isolation forest算法检测异常值?我应该使用什么编码器?另外,当我尝试运行代码时出现错误,错误在这一行:

    input_par = encoder.transform([val])

这是我写的示例代码:

import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler

l = ['fresh','good','nice','great','amazing','bad','tasty','funny','dirty',
'I love you', 'you are great', 'this is wrong', 'terrible']

df = pd.DataFrame({'my text': l})
x = df

# Transform the features
encoder = ColumnTransformer(transformers=[('onehot', OneHotEncoder(), ['my text'])], remainder='passthrough')
x = encoder.fit_transform(x)

isolation_forest = IsolationForest(contamination = 'auto', behaviour = 'new')
model = isolation_forest.fit(x)

list_of_val = ['i love this', 'catastrophic', 'no']

for val in list_of_val:

    input_par = encoder.transform([val])

    outlier = model.predict(input_par)
    #print(outlier)

    if outlier[0] == -1:
        print('Values', val, 'are outliers')

    else:
        print('Values', val, 'are not outliers')
4

0 回答 0