2

我正在尝试将 sklearn 的 CountVectorizer 与给定的词汇表一起使用。我的词汇是:

['humanitarian crisis', 'vacations for the anti-cruise crowd', 'school textbook', "b'cruise vacations for the anti-cruise", 'budget deal', "b'public school", 'u.n. announces', 'wrong petrol', 'vacations for the anti-cruise', "b'cruise vacations for the anti-cruise crowd"]

向量化的输入取自 pandas 数据帧。我从带有pd.read_csvand的 csv 中读到了这个encoding='utf8'

29371            b'9 quirky and brilliant paris boutiques'
20525    b'public school textbook filled with muslim bi...
2871     b'congress focuses on averting shutdown, but t...
29902    b'yarmouk siege: u.n. announces trip to syria ...
45596    b'fracking protesters arrested for gluing them...
6266         b'cruise vacations for the anti-cruise crowd'

调用 后CountVectorizer(vocabulary=vocabulary).fit_transform(),我得到一个全为零的矩阵:

(<6x10 sparse matrix of type '<type 'numpy.int64'>'
    with 0 stored elements in Compressed Sparse Row format>, <class 'scipy.sparse.csr.csr_matrix'>)

这是因为字符串类型的问题,还是我如何调用 CountVectorizer 的问题?我不确定如何转换字符串类型;我在 python2.7 和 pandas 中尝试了多种不同的encode调用decode。任何建议,将不胜感激。

4

1 回答 1

1

调用 CountVectorizer 时使用“ngram_range = (min_word_count, max_word_count)”。

于 2018-12-15T03:30:58.353 回答