python - Beautifull soup 常用爬取数据中的文本提取耗时太长

Question

我必须在常见的爬网数据集（warc.gz 文件）中解析 html 内容。我决定使用bs4（Beautifulsoup）模块，因为大多数人都建议它。以下是获取文本的代码片段：

from bs4 import BeautifulSoup

soup = BeautifulSoup(src, "lxml")
[x.extract() for x in soup.findAll(['script', 'style'])]
txt = soup.get_text().encode('utf8')

没有bs4，一个文件在 9 分钟内完成处理（测试用例），但如果我bs4用来解析文本，那么 Job 在大约 4 小时内完成。这是怎么回事。除了有没有更好的解决方案bs4？注意：bs4 是包含许多模块的类，例如 Beautifilsoup。

score 1 · Accepted Answer

这里主要耗时的部分是列表压缩中标签的提取。使用lxml和 python 正则表达式，您可以执行以下操作。

import re

script_pat = re.compile(r'<script.*?<\/script>')

# to find all scripts tags
script_pat.findall(src)

# do your stuff
print re.sub(script_pat, '', src)

使用lxml你可以这样做：

from lxml import html, tostring
et = html.fromstring(src)

# remove the tags
[x.drop_tag() for x in et.xpath('//script')]

# do your stuff
print tostring(et)

python - Beautifull soup 常用爬取数据中的文本提取耗时太长

1 回答 1

Related

Reference