machine-learning - 在 tfidf 之前使用 CountVectorizer (max_df) 的管道

Question

目前我不确定这个方程是否适用于 stackoverflow 或另一个更理论的统计 QA。但我对以下内容感到困惑。

我正在做一个二进制 tekst 分类任务。对于此任务，我使用管道，示例代码之一如下：

pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LogisticRegression())
])

parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (1, 3)],  
    'vect__stop_words': [None, stopwords.words('dutch'), stopwordList],
    'clf__C': [0.1, 1, 10, 100, 1000]
}

所以这没什么奇怪的，但后来我开始使用参数选项/设置，并注意到下面的代码（因此代码中的步骤和参数）具有最高的准确度得分（f1 得分）：

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression())
    ])

parameters = {
    'vect__ngram_range': [(1,1)],  
    'vect__stop_words': [None],
    'vect__max_df': [0.2], 
    'vect__max_features': [10000],
    'clf__C': [100]
}

所以我很高兴能找出我获得最高分的参数设置和方法，但我无法弄清楚确切的含义。与“vectorizor”步骤一样，max_df 的设置（忽略出现在 20% 以上的文档中的术语）在 tfidf 之前应用似乎很奇怪（或以某种方式加倍）

此外，它还使用 10.000 的 max_features。在 max_df 或 max_features 之前使用了什么步骤？以及如何解释 max_features 设置此参数并在之后执行 tfidf 。然后它会在 10.000 个特征上执行 tfidf 吗？

对我来说，在使用 max_df 和 max_features 等参数后执行 tfidf 似乎很奇怪？我对么？为什么？还是我应该做能带来最高结果的事情..

我希望有人可以在正确的方向上帮助我，非常感谢。

machine-learning - 在 tfidf 之前使用 CountVectorizer (max_df) 的管道

0 回答 0

Related

Reference