我正在使用逻辑回归模型执行网格搜索交叉验证。我首先有我的默认模型,然后是应该预处理数据的模型。数据是属于 4 个类别之一的随机文本文档。即使我只是让它返回数据,我的预处理器似乎也会降低我的准确性和 f1 分数,如下所示。网格搜索在通过不应该做任何事情的预处理后选择的正则化参数 C。
Cs = {'C' : [0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]}
gs_clf_LR = GridSearchCV(LogisticRegression(penalty='l2'), Cs, refit=True)
gs_clf_LR.fit(transformed_train_data, train_labels)
preds = gs_clf_LR.predict(transformed_dev_data)
#print gs_clf_LR.score(transformed_dev_data, dev_labels)
print gs_clf_LR.best_params_
print 'With optimal C, accuracy score is: ', gs_clf_LR.best_score_
print 'f1 score: ', metrics.f1_score(dev_labels, preds, average='weighted')
print metrics.classification_report(dev_labels, preds)
print
def better_preprocessor(string):
#return re.sub(r'^[A-Z]', '^[a-z]', string)
#return re.sub(r'(ing)$', '', string)
#return re.sub(r'(es)$', '', string)
#return re.sub(r's$', '', string)
#return re.sub(r'(ed)$', '', string)
return string
vec = CountVectorizer(preprocessor=better_preprocessor)
transformed_preprocessed_train_data = vec.fit_transform(train_data)
transformed_preprocessed_dev_data = vec.transform(dev_data)
gs_clf_LR.fit(transformed_preprocessed_train_data, train_labels)
preds_pp = gs_clf_LR.predict(transformed_preprocessed_dev_data)
#print gs_clf_LR.score(transformed_preprocessed_dev_data, dev_labels)
print gs_clf_LR.best_params_
print 'With optimal C, accuracy score is: ', gs_clf_LR.best_score_
print 'f1 score: ', metrics.f1_score(dev_labels, preds_pp, average='weighted')
print metrics.classification_report(dev_labels, preds_pp)
通过一些真正的预处理,例如我已经注释掉的正则表达式行,我还看到我的准确性和 f1 分数有所下降(这似乎是合理的,但我正在摆脱复数形式,并被告知这应该会提高我的分数)。