python-2.7 - sklearn CountVectorizer

Question

我对使用 words_.get 有疑问，代码如下。如下所示，我在其中一个机器学习练习中使用了 CountVectorizer，以获取特定单词的出现次数。

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
s1 = 'KJ YOU WILL BE FINE'
s2 = 'ABHI IS MY BESTIE'
s3 = 'sam is my bestie'
frnd_list = [s1,s2,s3]
bag_of_words = vectorizer.fit(frnd_list)
bag_of_words = vectorizer.transform(frnd_list)
print(bag_of_words)
# To get the feature word number from word 
#for eg:
print(vectorizer.vocabulary_.get('bestie'))
print(vectorizer.vocabulary_.get('BESTIE'))

输出：

Bag_of_words is :
(0, 1)  1
(0, 3)  1
(0, 5)  1
(0, 8)  1
(0, 9)  1
(1, 0)  1
(1, 2)  1
(1, 4)  1
(1, 6)  1
(2, 2)  1
(2, 4)  1
(2, 6)  1
(2, 7)  1

'bestie' has  feature number:
 2
'BESTIE' has feature number:
 None

因此我怀疑为什么'bistie'显示正确的特征编号，即 2 而'BESTIE'显示 None 。词汇表_.get 不能很好地与大写向量一起使用吗？

score 2 · Accepted Answer

CountVectorizer采用lowercase默认为的参数，如此处True的文档中所述：

lowercase : boolean, True by default
    Convert all characters to lowercase before tokenizing.

False如果您想以不同方式处理小写和大写，请将其更改为。

score 0 · Accepted Answer

countvectorizer 采用参数“小写”，默认情况下其值为 true

如果我们想区分大小写字母，那么设置 lowercase=False

有关更多信息，请单击此处http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

python-2.7 - sklearn CountVectorizer

2 回答 2

Related

Reference