我有这个功能,它提供了一个包含很多单词的文本文档。遇到下面这样的词时,它会窒息。如何指定正确的编码?我尝试了 encoding = 'string'、'unicode' 等和 decode_error='ignore' 等,但它不起作用。
co¤a co¤azo
def tokenize(text):
sentence = [text]
ngv2 = CountVectorizer(encoding='utf-8', analyzer='word', min_df=1, stop_words='english')
try:
ngv2.fit_transform(sentence)
except Exception:
print sentence
S = ngv2.get_feature_names()
ngw = CountVectorizer(analyzer='char_wb', ngram_range=(3, 7), min_df=1)
ngw.fit_transform(S)
return ngw.get_feature_names()
编辑:我更改了代码,以便跳过异常。将其归结为最简单的错误(引发异常的代码片段,然后是错误输出):
ngv2 = CountVectorizer(decode_error='replace', analyzer='word', min_df=1, stop_words='english')
try:
ngv2.fit_transform(sentence)
except Exception as e:
print sentence, e.message
打印'[]p[]p['的输入:
empty vocabulary; perhaps the documents only contain stop words