scikit-learn - 在 sklearn CountVectorizer() 中使用 bigrams 提供 stop_words

翻译自：https://stackoverflow.com/questions/47338182 2017-11-16T20:04:03.050

539 次

是否有一种廉价且简单的方法可以防止 sklearnCountVectorizer仅使用stop_words参数停止一元组，并使其也停止二元组？我的意思在以下片段中进行了说明：

from sklearn.feature_extraction.text import CountVectorizer

texts = ['hello this is text number one yes yes',
        'hello this is text number two stackflow']

stop_words = {'hello this'}

model = CountVectorizer(analyzer='word', 
                        ngram_range=(1,2), 
                        max_features=3,
                        stop_words=stop_words)

doc_vectors = model.fit_transform(texts).toarray()
print(doc_vectors)
print(model.get_feature_names())

所以这段代码的作用是输出以下内容：

>>> [[1 1 1]
>>>  [1 1 1]]
>>> ['hello', 'hello this', 'is']

如您所见，我希望计算出双字母“你好”（它被喂给停用词）。我看过一些他们使用管道或自定义分析器的帖子，并且我浏览了文档，但是没有更简单的方法解决这个问题吗？

谢谢！

scikit-learn - 在 sklearn CountVectorizer() 中使用 bigrams 提供 stop_words

0 回答 0

Related

Reference