python - UnicodeDecodeError：“utf8”编解码器无法解码位置 0 的字节 0xc3：数据意外结束

Question

我正在编写用于阻止推文的代码，但我遇到了编码问题。当我尝试应用搬运工词干分析器时，它显示错误。也许我无法正确标记它。

我的代码如下...

import sys
import pandas as pd
import nltk
import scipy as sp
from nltk.classify import NaiveBayesClassifier
from nltk.stem import PorterStemmer
reload(sys)  
sys.setdefaultencoding('utf8')


stemmer=nltk.stem.PorterStemmer()

p_test = pd.read_csv('TestSA.csv')
train = pd.read_csv('TrainSA.csv')

def word_feats(words):
    return dict([(word, True) for word in words])

for i in range(len(train)-1):
    t = []
    #train.SentimentText[i] = " ".join(t)
    for word in nltk.word_tokenize(train.SentimentText[i]):
        t.append(stemmer.stem(word))
    train.SentimentText[i] = ' '.join(t)

当我尝试执行时，它返回错误：

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-10-5aa856d0307f> in <module>()
     23     #train.SentimentText[i] = " ".join(t)
     24     for word in nltk.word_tokenize(train.SentimentText[i]):
---> 25         t.append(stemmer.stem(word))
     26     train.SentimentText[i] = ' '.join(t)
     27 

/usr/lib/python2.7/site-packages/nltk/stem/porter.pyc in stem(self, word)
    631     def stem(self, word):
    632         stem = self.stem_word(word.lower(), 0, len(word) - 1)
--> 633         return self._adjust_case(word, stem)
    634 
    635     ## --NLTK--

/usr/lib/python2.7/site-packages/nltk/stem/porter.pyc in _adjust_case(self, word, stem)
    602         for x in range(len(stem)):
    603             if lower[x] == stem[x]:
--> 604                 ret += word[x]
    605             else:
    606                 ret += stem[x]

/usr/lib64/python2.7/encodings/utf_8.pyc in decode(input, errors)
     14 
     15 def decode(input, errors='strict'):
---> 16     return codecs.utf_8_decode(input, errors, True)
     17 
     18 class IncrementalEncoder(codecs.IncrementalEncoder):

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: unexpected end of data

任何人都有任何线索，我的代码有问题。我被这个错误困住了。任何建议..？

score 3 · Accepted Answer

我认为关键行是 604，在引发错误的地方上方一帧：

--> 604                 ret += word[x]

可能ret是一个 Unicode 字符串并且word是一个字节字符串。而且您不能像该循环尝试那样逐字节解码 UTF-8。

问题是read_csv返回字节，而您正在尝试对这些字节进行文本处理。这根本行不通，必须先将这些字节解码为 Unicode。我认为你可以使用：

pandas.read_csv(filename, encoding='utf-8')

如果可能，请使用 Python 3。然后尝试连接字节和 unicode 总是会引发错误，从而更容易发现这些问题。

python - UnicodeDecodeError：“utf8”编解码器无法解码位置 0 的字节 0xc3：数据意外结束

1 回答 1

Related

Reference