python - python sklearn 不仅使用计数功能进行朴素贝叶斯学习

Question

首先，我是 python 和 nlp / 机器学习的新手。现在我有以下代码：

vectorizer = CountVectorizer(
   input="content", 
   decode_error="ignore", 
   strip_accents=None,
   stop_words = stopwords.words('english'),
   tokenizer=myTokenizer
)
counts = vectorizer.fit_transform(data['message'].values)
classifier = MultinomialNB()
targets = data['sentiment'].values
classifier.fit(counts, targets)

现在这实际上工作得很好。我通过使用矩阵和目标CountVectorizer得到一个稀疏矩阵。classifier(0,2,4)

但是，如果我想在向量中使用更多特征而不仅仅是字数，我该怎么办？我似乎无法找到它。先感谢您。

score 1 · Accepted Answer

在你的情况下counts是一个稀疏矩阵；您可以使用额外的功能向其中添加列：

import numpy as np
from scipy import sparse as sp

counts = vectorizer.fit_transform(data['message'].values)
ones = np.ones(shape=(len(data), 1))
X = sp.hstack([counts, ones])

classifier.fit(X, targets)

scikit-learn 还为此提供了内置帮助程序；它被称为FeatureUnion。在 scikit-learn文档中有一个组合来自两个转换器的特征的示例：

estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
combined = FeatureUnion(estimators)

# then you can do this:
X = combined.fit_transform(my_data)

FeatureUnion 的作用几乎相同：它采用向量化器列表（带有名称），为相同的输入数据调用它们，然后按列连接结果。

通常最好使用 FeatureUnion，因为您可以更轻松地使用 scikit-learn 交叉验证、腌制最终管道等。

另请参阅这些教程：

score 0 · Accepted Answer

这取决于您的数据以及您要执行的操作。除了字数之外，您还可以使用不同的转换方法：Bag of Words、TFIDF、Word Vector、...

您可以从这些文档中了解更多信息： - http://billchambers.me/tutorials/2015/01/14/python-nlp-cheatsheet-nltk-scikit-learn.html - http://scikit-learn.org/stable /tutorial/text_analytics/working_with_text_data.html

python - python sklearn 不仅使用计数功能进行朴素贝叶斯学习

2 回答 2

Related

Reference