python - BeautifulSoup 无法使用 `html5lib` 解析 html

Question

BeautifulSoup 无法解析带有选项的 html 页面html5lib，但使用该选项可以正常工作html.parser。根据文档，html5lib应该比html.parser，那么为什么我在使用它来解析 html 页面时遇到乱码？

下面是一个可执行的小例子。（修改html5libwith后html.parser，中文输出正常。）

#_*_coding:utf-8_*_
import requests
from bs4 import BeautifulSoup

ss = requests.Session()
res = ss.get("http://tech.qq.com/a/20151225/050487.htm")
html = res.content.decode("GBK").encode("utf-8")
soup = BeautifulSoup(html, 'html5lib')
print str(soup)[0:800]  # where you can see if the html is parsed normally or not

score 1 · Accepted Answer

不要重新编码您的内容。将解码处理留给 Beautifulsoup：

soup = BeautifulSoup(res.content, 'html5lib')

如果要重新编码，则需要替换meta源中存在的标头：

<meta http-equiv="Content-Type" content="text/html; charset=gb2312">

或手动解码并传入 Unicode：

soup = BeautifulSoup(res.content.decode('gbk'), 'html5lib')

python - BeautifulSoup 无法使用 `html5lib` 解析 html

1 回答 1

Related

Reference