我想知道是否可以检测分类数据中的异常值。例如,如果我有列表:
l = ['fresh','good','nice','great','amazing','bad','tasty','funny','dirty']
此列表中的异常值将是字符串'bad'和'dirty'.
例如,如果我有 10000 行的数据,并且每行是 2 到 10 个单词之间的句子,是否可以用Isolation forest算法检测异常值?我应该使用什么编码器?另外,当我尝试运行代码时出现错误,错误在这一行:
input_par = encoder.transform([val])
这是我写的示例代码:
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
l = ['fresh','good','nice','great','amazing','bad','tasty','funny','dirty',
'I love you', 'you are great', 'this is wrong', 'terrible']
df = pd.DataFrame({'my text': l})
x = df
# Transform the features
encoder = ColumnTransformer(transformers=[('onehot', OneHotEncoder(), ['my text'])], remainder='passthrough')
x = encoder.fit_transform(x)
isolation_forest = IsolationForest(contamination = 'auto', behaviour = 'new')
model = isolation_forest.fit(x)
list_of_val = ['i love this', 'catastrophic', 'no']
for val in list_of_val:
input_par = encoder.transform([val])
outlier = model.predict(input_par)
#print(outlier)
if outlier[0] == -1:
print('Values', val, 'are outliers')
else:
print('Values', val, 'are not outliers')