python - Python - BeautifulSoup html 解析处理 gbk 编码不佳 - 中文网页抓取问题

Question

我一直在修改以下脚本：

#    -*- coding: utf8 -*-
import codecs
from BeautifulSoup import BeautifulSoup, NavigableString,
UnicodeDammit
import urllib2,sys
import time
try:
    import timeoutsocket # http://www.timo-tasi.org/python/timeoutsocket.py
    timeoutsocket.setDefaultSocketTimeout(10)
except ImportError:
    pass

h=u'\u3000\u3000\u4fe1\u606f\u901a\u4fe1\u6280\u672f'

address=urllib2.urlopen('http://stock.eastmoney.com/news/1408,20101022101395594.html').read()
soup=BeautifulSoup(address)

p=soup.findAll('p')
t=p[2].string[:10]

具有以下输出：

打印

¡¡¡¡¡

打印 h

　　信息通</p>

吨

你'\xa1\xa1\xa1\xa1\xd0\xc5\xcf\xa2\xcd\xa8'

H

u'\u3000\u3000\u4fe1\u606f\u901a'

h.encode('gbk')

'\xa1\xa1\xa1\xa1\xd0\xc5\xcf\xa2\xcd\xa8'

简单的说：当我通过BeautifulSoup传入这个html时，它取gbk编码的文本，认为是unicode，不识别需要先解码。但是，“h”和“t”应该是相同的，因为 h 只是我从 html 文件中获取文本并手动转换它。

我该如何解决这个问题？

最好的

惠顿

score 5 · Accepted Answer

该文件的元标记声称字符集是GB2312，但数据包含来自较新的GBK / GB18030的字符，这就是使 BeautifulSoup 绊倒的原因：

simon@lucifer:~$ python
Python 2.7（r27:82508，2010 年 7 月 3 日，21:12:11）
[GCC 4.0.1 (Apple Inc. build 5493)] 在 darwin
输入“帮助”、“版权”、“信用”或“许可”以获取更多信息。
>>> 导入 urllib2
>>> 数据 = urllib2.urlopen('http://stock.eastmoney.com/news/1408,20101022101395594.html').read()
>>> data.decode("gb2312")
回溯（最近一次通话最后）：
  文件“”，第 1 行，在
UnicodeDecodeError：“gb2312”编解码器无法解码位置 20148-20149 中的字节：非法多字节序列

在这一点上，UnicodeDammit 退出，尝试chardet、UTF-8和最后的 Windows-1252，它总是成功 - 从外观上看，这就是你得到的。

如果我们告诉解码器用“？”替换无法识别的字符，我们可以看到 GB2312 中缺少的字符：

>>> 打印数据[20140:20160].decode("gb2312", "replace")
毒尾气二中的难排放

使用正确的编码：

>>> 打印数据[20140:20160].decode("gb18030", "replace")
毒尾气二恶英的难排放
>>> 从 BeautifulSoup 导入 BeautifulSoup
>>> s = BeautifulSoup(数据, fromEncoding="gb18030")
>>> 打印 s.findAll("p")[2].string[:10]
　　信息通信技术是&

还：

>>> 打印 s.findAll("p")[2].string
　　通信“十二五”重点发展方向，行业内有信息规划的技术潜点
力，拓远高于GDP。软件智能家居、智能家居、智能家居、智能家居、智能家居、智能家居、智能家居、智能家居、智能家居
移动、移动游戏、网络视频等均存在移动网络办公的潜在需求，使信息更新更进一步
高增长。

python - Python - BeautifulSoup html 解析处理 gbk 编码不佳 - 中文网页抓取问题

1 回答 1

Related

Reference