python-3.x - 如何使用 Scikit 学习 CountVectorizer？

Question

我有一组单词，我必须检查它们是否存在于文档中。

WordList = [w1, w2, ..., wn]

另一组有文件列表，我必须检查这些词是否存在。

如何使用 scikit-learnCountVectorizer以使术语文档矩阵的特征只是来自单词WordList，每一行代表每个特定文档，给定列表中的单词没有出现在各自列中的次数？

score 2 · Accepted Answer

对于自定义文档，您可以使用 Count Vectorizer 方法

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer() #make object of Count Vectorizer
corpus = [
      'This is a cat.',
      'It likes to roam in the garden',
      'It is black in color',
      'The cat does not like the dog.',
      ]
X = vectorizer.fit_transform(corpus)
#print(X) to see count given to words

vectorizer.get_feature_names() == (
['cat', 'color', 'roam', 'The', 'garden',
 'dog', 'black', 'like', 'does', 'not',
 'the', 'in', 'likes'])

X.toarray()
#used to convert X into numpy array

vectorizer.transform(['A new cat.']).toarray()
# Checking it for a new document

也可以使用其他 Vectorizer，例如 Tfidf Vectorizer。Tfidf 向量化器是一种更好的方法，因为它不仅提供了特定文档中单词的出现次数，而且还说明了单词的重要性。

它是通过找到 TF- 词频和 IDF- 逆文档频率来计算的。

Term Freq 是单词在特定文档中出现的次数，IDF 是根据文档的上下文计算的。例如，如果文档与足球有关，那么单词“the”不会给出任何见解，但单词“messi”会说明文档的上下文。它是通过记录出现次数来计算的。例如。tf("the") = 10 tf("梅西") = 5

idf("the") = log(10) = 0
idf("messi") = log(5) = 0.52

tfidf("the") = tf("the") * idf("the") = 10 * 0 = 0
tfidf("messi") = 5 * 0.52 = 2.6

这些权重帮助算法从文档中识别出重要的单词，然后帮助从文档中推导出语义。

score 2 · Accepted Answer

好的。我得到它。代码如下：

from sklearn.feature_extraction.text import CountVectorizer
# Counting the no of times each word(Unigram) appear in document. 
vectorizer = CountVectorizer(input='content',binary=False,ngram_range=(1,1))
# First set the vocab
vectorizer = vectorizer.fit(WordList)
# Now transform the text contained in each document i.e list of text 
Document:list
tfMatrix = vectorizer.transform(Document_List).toarray()

这将仅输出具有 wordList 特征的术语文档矩阵。

python-3.x - 如何使用 Scikit 学习 CountVectorizer？

2 回答 2

Related

Reference