python - 使用隔离森林对分类数据进行异常检测

Question

我想知道是否可以检测分类数据中的异常值。例如，如果我有列表：

l = ['fresh','good','nice','great','amazing','bad','tasty','funny','dirty']

此列表中的异常值将是字符串'bad'和'dirty'.

例如，如果我有 10000 行的数据，并且每行是 2 到 10 个单词之间的句子，是否可以用Isolation forest算法检测异常值？我应该使用什么编码器？另外，当我尝试运行代码时出现错误，错误在这一行：

    input_par = encoder.transform([val])

这是我写的示例代码：

import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler

l = ['fresh','good','nice','great','amazing','bad','tasty','funny','dirty',
'I love you', 'you are great', 'this is wrong', 'terrible']

df = pd.DataFrame({'my text': l})
x = df

# Transform the features
encoder = ColumnTransformer(transformers=[('onehot', OneHotEncoder(), ['my text'])], remainder='passthrough')
x = encoder.fit_transform(x)

isolation_forest = IsolationForest(contamination = 'auto', behaviour = 'new')
model = isolation_forest.fit(x)

list_of_val = ['i love this', 'catastrophic', 'no']

for val in list_of_val:

    input_par = encoder.transform([val])

    outlier = model.predict(input_par)
    #print(outlier)

    if outlier[0] == -1:
        print('Values', val, 'are outliers')

    else:
        print('Values', val, 'are not outliers')

python - 使用隔离森林对分类数据进行异常检测

0 回答 0

Related

Reference