python - 所有预处理都会降低准确性

Question

我正在使用逻辑回归模型执行网格搜索交叉验证。我首先有我的默认模型，然后是应该预处理数据的模型。数据是属于 4 个类别之一的随机文本文档。即使我只是让它返回数据，我的预处理器似乎也会降低我的准确性和 f1 分数，如下所示。网格搜索在通过不应该做任何事情的预处理后选择的正则化参数 C。

Cs = {'C' : [0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]}
gs_clf_LR = GridSearchCV(LogisticRegression(penalty='l2'), Cs, refit=True)
gs_clf_LR.fit(transformed_train_data, train_labels)
preds = gs_clf_LR.predict(transformed_dev_data)
#print gs_clf_LR.score(transformed_dev_data, dev_labels)
print gs_clf_LR.best_params_
print 'With optimal C, accuracy score is: ', gs_clf_LR.best_score_
print 'f1 score: ', metrics.f1_score(dev_labels, preds, average='weighted')
print metrics.classification_report(dev_labels, preds)
print

def better_preprocessor(string):
    #return re.sub(r'^[A-Z]', '^[a-z]', string)
    #return re.sub(r'(ing)$', '', string)
    #return re.sub(r'(es)$', '', string)
    #return re.sub(r's$', '', string)
    #return re.sub(r'(ed)$', '', string)
    return string


vec = CountVectorizer(preprocessor=better_preprocessor)
transformed_preprocessed_train_data = vec.fit_transform(train_data)
transformed_preprocessed_dev_data = vec.transform(dev_data)

gs_clf_LR.fit(transformed_preprocessed_train_data, train_labels)
preds_pp = gs_clf_LR.predict(transformed_preprocessed_dev_data)
#print gs_clf_LR.score(transformed_preprocessed_dev_data, dev_labels)
print gs_clf_LR.best_params_
print 'With optimal C, accuracy score is: ', gs_clf_LR.best_score_
print 'f1 score: ', metrics.f1_score(dev_labels, preds_pp, average='weighted')
print metrics.classification_report(dev_labels, preds_pp)

通过一些真正的预处理，例如我已经注释掉的正则表达式行，我还看到我的准确性和 f1 分数有所下降（这似乎是合理的，但我正在摆脱复数形式，并被告知这应该会提高我的分数）。

score 2 · Accepted Answer

你有没有从你的数据中分离出一个随机生成的测试集，它存在于交叉验证之外，以测试这两个模型？准确度下降可能是通过减少对数据的过度拟合来实现更大泛化的结果。

score 0 · Accepted Answer

问题是您的预处理基本上什么都不做，因为预处理是在标记化之前发生在 CountVectorizer 中的。这意味着您可以通过函数获得整个文本，并且$最后不会触发正则表达式。

这是将矢量化器与您的拟合的结果better_preprocessing：

In [16]: data = ['How are you guys doing? Fine! We are very satisfied']

In [17]: vec.fit(data)
Out[17]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1),
        preprocessor=<function better_preprocessor at 0x000002DB839FF048>,
        stop_words=None, strip_accents=None,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)

In [18]: vec.get_feature_names()
Out[18]: ['Fine', 'We', 'are', 'doing', 'guys', 'ow', 'satisfied', 'very', 'you']

这意味着您必须analyzer使用您的函数覆盖该步骤，而不是preprocessor. 相比：

分析器：字符串、{'word'、'char'、'char_wb'} 或可调用该特征是否应由单词或字符 n-gram 组成。选项 'char_wb' 仅从单词边界内的文本创建字符 n-gram。如果传递了一个可调用对象，则它用于从未处理的原始输入中提取特征序列。

预处理器：可调用或无（默认）覆盖预处理（字符串转换）阶段，同时保留标记化和 n-gram 生成步骤。

但是，您必须在函数中处理标记化，但您可以使用 default '(?u)\\b\\w\\w+\\b'，所以这并不难。无论如何，我认为您的方法并不可靠，我建议您使用类似SnowballStemmerNLTK 的方法而不是这些正则表达式。

python - 所有预处理都会降低准确性

2 回答 2

Related

Reference