python - 为什么从汤对象中提取文本时 HTML 源代码会发生变化？

翻译自：https://stackoverflow.com/questions/58416197 2019-10-16T14:52:33.687

99 次

我正在尝试在 Python 上使用 Selenium 和 BeautifulSoup 从搜索词的结果中抓取新闻文章。我已经到达包含使用以下文本的最后一页：

article_page = requests.get(articles.link_of_article[0])
article_soup = BeautifulSoup(article_page.text, "html.parser")
for content in article_soup.find_all('div',{"class":"name_of_class_with_contained_text"}):
     content.get_text()

我注意到"name_of_class_with_contained_text"当我在浏览器中目视检查源代码时存在这种情况，但汤对象中不存在该类。此外，所有"p"标签都替换为以下代码"\\u003c/p\\u003e\\u003cp\\u003e \\u003c/p\\u003e\\u003cp\\u003e"。

我无法找到类名或标签来获取包含的文本。任何关于为什么会发生这种情况的帮助或推理将不胜感激。

PS：抓取和 HTML 相对较新

更新：在此处添加最后一页的链接。

https://www.fundfire.com/c/2258443/277443?referrer_module=searchSubFromFF&highlight=eileen%20neill%20verus

python - 为什么从汤对象中提取文本时 HTML 源代码会发生变化？

0 回答 0

Related

Reference