python - 将非 ascii 字符串传递给 fit_transform

翻译自：https://stackoverflow.com/questions/45539060 2017-08-07T03:37:31.317

219 次

我有这个功能，它提供了一个包含很多单词的文本文档。遇到下面这样的词时，它会窒息。如何指定正确的编码？我尝试了 encoding = 'string'、'unicode' 等和 decode_error='ignore' 等，但它不起作用。

co¤a co¤azo

def tokenize(text):
  sentence = [text]
  ngv2 = CountVectorizer(encoding='utf-8', analyzer='word', min_df=1, stop_words='english')
  try:
     ngv2.fit_transform(sentence)
  except Exception:
     print sentence
  S = ngv2.get_feature_names()
  ngw = CountVectorizer(analyzer='char_wb', ngram_range=(3, 7), min_df=1)
  ngw.fit_transform(S)
  return ngw.get_feature_names()

编辑：我更改了代码，以便跳过异常。将其归结为最简单的错误（引发异常的代码片段，然后是错误输出）：

ngv2 = CountVectorizer(decode_error='replace', analyzer='word', min_df=1, stop_words='english') 
try: 
  ngv2.fit_transform(sentence) 
except Exception as e: 
  print sentence, e.message

打印'[]p[]p['的输入：

 empty vocabulary; perhaps the documents only contain stop words

python - 将非 ascii 字符串传递给 fit_transform

0 回答 0

Related

Reference