我一直在尝试从网站上抓取数据并将找到的数据写到文件中。超过 90% 的时间,我不会遇到 Unicode 错误,但是当数据具有以下字符(例如“汉堡王®,汉斯咖啡馆”)时,它不喜欢将其写入文件,因此我的错误处理打印它按原样显示在屏幕上,没有任何进一步的错误。
我已经尝试过编码和解码功能以及各种编码,但无济于事。
请在下面找到我编写的当前代码的摘录:
import urllib2,sys
import re
import os
import urllib
import string
import time
from BeautifulSoup import BeautifulSoup,NavigableString, SoupStrainer
from string import maketrans
import codecs
f=codecs.open('alldetails7.txt', mode='w', encoding='utf-8', errors='replace')
...
soup5 = BeautifulSoup(html5)
enc_s5 = soup5.originalEncoding
for company in iter(soup5.findAll(height="20px")):
stream = ""
count_detail = 1
for tag in iter(company.findAll('td')):
if count_detail > 1:
stream = stream + tag.text.replace(u',',u';')
if count_detail < 4 :
stream=stream+","
count_detail = count_detail + 1
stream.strip()
try:
f.write(str(stnum)+","+br_name_addr+","+stream.decode(enc_s5)+os.linesep)
except:
print "Unicode error ->"+str(storenum)+","+branch_name_address+","+stream