python - 使用 BeautifulSoup 从 text/html 文档中获取干净的文本

Question

我有一个包含两种内容类型的文档：text/xml 和 text/html。我想使用 BeautifulSoup 来解析文档并最终得到一个干净的文本版本。该文档以元组开始，因此我一直使用 repr 将其转换为 BeautifulSoup 识别的内容，然后使用 find_all 通过搜索 div 来查找文档的 text/html 位，如下所示：

soup = BeautifulSoup(repr(msg_data))
text = soup.html.find_all("div")

然后，我将文本转回字符串，将其保存到变量中，然后将其转回汤对象并在其上调用 get_text，如下所示：

str_text = str(text)
soup_text = BeautifulSoup(str_text)
soup_text.get_text()

但是，然后将编码更改为 unicode，如下所示：

u'[9:16 PM\xa0Erica: with images, \xa0\xa0and that seemed long to me anyway, 9:17     
PM\xa0me: yeah, \xa0Erica: so feel free to make it shorter, \xa0\xa0or rather, please do, 
9:18 PM\xa0nobody wants to read about that shit for 2 pages, \xa0me: :), \xa0Erica: while 
browsing their site, \xa0me: srsly, \xa0Erica: unless of course your writing is magic, 
\xa0me: My writing saves drowning puppies, \xa0\xa0Just plucks him right out and gives 
them a scratch behind the ears and some kibble, \xa0Erica: Maine is weird, \xa0me: haha]'

当我尝试将其重新编码为 UTF-8 时，如下所示：

soup.encode('utf-8')

我回到未解析的类型。

我想将干净的文本保存为字符串，然后我可以在文本中找到特定的内容（例如，上面文本中的“小狗”）。

基本上，我在这里兜圈子。任何人都可以帮忙吗？与往常一样，非常感谢您提供的任何帮助。

score 3 · Accepted Answer

编码没有被破坏；这正是它应该是的。'\xa0'是不间断空格的 Unicode。

如果您想将此（Unicode）字符串编码为 ASCII，您可以告诉编解码器忽略它不理解的任何字符：

>>> x = u'[9:16 PM\xa0Erica: with images, \xa0\xa0and that seemed long to me anyway, 9:17 PM\xa0me: yeah, \xa0Erica: so feel free to make it shorter, \xa0\xa0or rather, please do,  9:18 PM\xa0nobody wants to read about that shit for 2 pages, \xa0me: :), \xa0Erica: while  browsing their site, \xa0me: srsly, \xa0Erica: unless of course your writing is magic,  \xa0me: My writing saves drowning puppies, \xa0\xa0Just plucks him right out and gives  them a scratch behind the ears and some kibble, \xa0Erica: Maine is weird, \xa0me: haha]'
>>> x.encode('ascii', 'ignore')
'[9:16 PMErica: with images, and that seemed long to me anyway, 9:17 PMme: yeah, Erica: so feel free to make it shorter, or rather, please do,  9:18 PMnobody wants to read about that shit for 2 pages, me: :), Erica: while  browsing their site, me: srsly, Erica: unless of course your writing is magic,  me: My writing saves drowning puppies, Just plucks him right out and gives  them a scratch behind the ears and some kibble, Erica: Maine is weird, me: haha]'

如果您有时间，您应该观看 Ned Batchelder 最近的视频Pragmatic Unicode。它会让一切变得清晰和简单！

python - 使用 BeautifulSoup 从 text/html 文档中获取干净的文本

1 回答 1

Related

Reference