javascript - 使用 Beautiful Soup 从 Google 搜索中提取数据/链接

Question

晚上的人们，

我试图向谷歌提问，并从其受人尊敬的搜索查询中提取所有相关链接（即我搜索“站点：Wikipedia.com Thomas Jefferson”，它给了我 wiki.com/jeff、wiki.com/tom、 ETC。）

这是我的代码：

from bs4 import BeautifulSoup
from urllib2 import urlopen

query = 'Thomas Jefferson'

query.replace (" ", "+")
#replaces whitespace with a plus sign for Google compatibility purposes

soup = BeautifulSoup(urlopen("https://www.google.com/?gws_rd=ssl#q=site:wikipedia.com+" + query), "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia.com so only wikipedia
#links show up. Uses html parser.

for item in soup.find_all('h3', attrs={'class' : 'r'}):
    print item.string
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results

这里的目标是让我设置查询变量，让 python 查询 Google，如果你愿意，Beautiful Soup 会拉出所有“绿色”链接。

这是 Google 结果页面的图片

我只希望完全拉出绿色链接。奇怪的是，谷歌的源代码是“隐藏的”（他们的搜索架构的一个症状），所以 Beautiful Soup 不能只是从 h3 标签中提取一个 href。当我检查元素时，我可以看到 h3 href，但在查看源代码时看不到。

这是检查元素的图片

我的问题是：如果我无法访问他们的源代码，只能检查元素，我该如何通过 BeautifulSoup 从 Google 中提取前 5 个最相关的绿色链接？

PS：为了了解我想要完成的工作，我发现了两个相对接近的 Stack Overflow 问题，比如我的：

美丽的汤从谷歌搜索中提取一个href

如何使用 python 用美汤收集谷歌搜索的数据

score 5 · Accepted Answer

当我尝试在禁用 JavaScript 的情况下进行搜索时，我得到的 URL 与 Rob M. 不同 -

https://www.google.com/search?q=site:wikipedia.com+Thomas+Jefferson&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw

要使其适用于任何查询，您首先应确保您的查询中没有空格（这就是您会得到 400：错误请求的原因）。您可以使用以下方法执行此操作urllib.quote_plus()：

query = "Thomas Jefferson"
query = urllib.quote_plus(query)

这会将所有空格 urlencode 为加号 - 创建一个有效的 URL。

但是，这不适用于urllib - 你会得到 403: Forbidden。我通过使用这样的python-requests模块让它工作：

import requests
import urllib
from bs4 import BeautifulSoup

query = 'Thomas Jefferson'
query = urllib.quote_plus(query)

r = requests.get('https://www.google.com/search?q=site:wikipedia.com+{}&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw'.format(query))
soup = BeautifulSoup(r.text, "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia.com so only wikipedia
#links show up. Uses html parser.

links = []
for item in soup.find_all('h3', attrs={'class' : 'r'}):
    links.append(item.a['href'][7:]) # [7:] strips the /url?q= prefix
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results

打印链接给出：

print links
#  [u'http://en.wikipedia.com/wiki/Thomas_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggUMAA&usg=AFQjCNG6INz_xj_-p7mpoirb4UqyfGxdWA',
#   u'http://www.wikipedia.com/wiki/Jefferson%25E2%2580%2593Hemings_controversy&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggeMAE&usg=AFQjCNEjCPY-HCdfHoIa60s2DwBU1ffSPg',
#   u'http://en.wikipedia.com/wiki/Sally_Hemings&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggjMAI&usg=AFQjCNGxy4i7AFsup0yPzw9xQq-wD9mtCw',
#   u'http://en.wikipedia.com/wiki/Monticello&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggoMAM&usg=AFQjCNE4YlDpcIUqJRGghuSC43TkG-917g',
#   u'http://en.wikipedia.com/wiki/Thomas_Jefferson_University&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggtMAQ&usg=AFQjCNEDuLjZwImk1G1OnNEnRhtJMvr44g',
#   u'http://www.wikipedia.com/wiki/Jane_Randolph_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggyMAU&usg=AFQjCNHmXJMI0k4Bf6j3b7QdJffKk97tAw',
#   u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1800&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg3MAY&usg=AFQjCNEqsc9jDsDetf0reFep9L9CnlorBA',
#   u'http://en.wikipedia.com/wiki/Isaac_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg8MAc&usg=AFQjCNHKAAgylhRjxbxEva5IvDA_UnVrTQ',
#   u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1796&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghBMAg&usg=AFQjCNHviErFQEKbDlcnDZrqmxGuiBG9XA',
#   u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1804&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghGMAk&usg=AFQjCNEJZSxCuXE_Dzm_kw3U7hYkH7OtlQ']

score 4 · Accepted Answer

实际上，没有必要禁用 JavaScript。这可能是因为您需要指定user-agent充当“真实”用户访问。

user-agent在使用requests库时指定 no时，它默认为python-requests，因此 Google 或其他搜索引擎知道它是一个机器人/脚本，并且可能会阻止请求并且接收到的 HTML 将包含某种带有不同元素的错误，这就是为什么你得到空的结果。

检查什么是你的user-agent或查看列表user-agents。

在线 IDE 中的代码和完整示例：

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

response = requests.get( 'https://www.google.com/search?q=site:wikipedia.com thomas edison', headers=headers).text
soup = BeautifulSoup(response, 'lxml')

for links in soup.find_all('div', class_='yuRUbf'):
    link = links.a['href']
    print(link)

# or using select() method which accepts CSS selectors

for links in soup.select('.yuRUbf a'):
    link = links['href']
    print(link)

输出：

https://en.wikipedia.com/wiki/Edison,_New_Jersey
https://en.wikipedia.com/wiki/Motion_Picture_Patents_Company
https://www.wikipedia.com/wiki/Thomas_E._Murray
https://en.wikipedia.com/wiki/Incandescent_light_bulb
https://en.wikipedia.com/wiki/Phonograph_cylinder
https://en.wikipedia.com/wiki/Emile_Berliner
https://wikipedia.com/wiki/Consolidated_Edison
https://www.wikipedia.com/wiki/hello
https://www.wikipedia.com/wiki/Tom%20Alston
https://en.wikipedia.com/wiki/Edison_screw

或者，您可以使用来自 SerpApi 的Google 搜索引擎结果 API 。这是一个带有免费计划的付费 API。

不同之处在于，您不必弄清楚要抓取哪些 HTML 元素来提取数据，了解如何绕过 Google 或其他搜索引擎的阻止，并随着时间的推移对其进行维护（如果 HTML 中的某些内容将被更改））。

要集成的示例代码：

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "site:wikipedia.com thomas edison",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
        print(f"Link: {result['link']}")

输出：

Link: https://en.wikipedia.com/wiki/Edison,_New_Jersey
Link: https://en.wikipedia.com/wiki/Motion_Picture_Patents_Company
Link: https://www.wikipedia.com/wiki/Thomas_E._Murray
Link: https://en.wikipedia.com/wiki/Incandescent_light_bulb
Link: https://en.wikipedia.com/wiki/Phonograph_cylinder
Link: https://en.wikipedia.com/wiki/Emile_Berliner
Link: https://wikipedia.com/wiki/Consolidated_Edison
Link: https://www.wikipedia.com/wiki/hello
Link: https://www.wikipedia.com/wiki/Tom%20Alston
Link: https://en.wikipedia.com/wiki/Edison_screw

免责声明，我为 SerpApi 工作。

PS 我有一个专门的网络抓取博客。

score 0 · Accepted Answer

这不适用于哈希搜索（#q=site:wikipedia.com就像您拥有的那样），因为它通过 AJAX 加载数据，而不是为您提供带有结果的完整可解析 HTML，您应该改用它：

soup = BeautifulSoup(urlopen("https://www.google.com/search?gbv=1&q=site:wikipedia.com+" + query), "html.parser")

作为参考，我禁用了 javascript 并执行了谷歌搜索以获取此 url 结构。

javascript - 使用 Beautiful Soup 从 Google 搜索中提取数据/链接

3 回答 3

Related

Reference