实际上,没有必要禁用 JavaScript。这可能是因为您需要指定user-agent
充当“真实”用户访问。
user-agent
在使用requests
库时指定 no时,它默认为python-requests,因此 Google 或其他搜索引擎知道它是一个机器人/脚本,并且可能会阻止请求并且接收到的 HTML 将包含某种带有不同元素的错误,这就是为什么你得到空的结果。
检查什么是你的user-agent
或查看列表user-agents
。
在线 IDE 中的代码和完整示例:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get( 'https://www.google.com/search?q=site:wikipedia.com thomas edison', headers=headers).text
soup = BeautifulSoup(response, 'lxml')
for links in soup.find_all('div', class_='yuRUbf'):
link = links.a['href']
print(link)
# or using select() method which accepts CSS selectors
for links in soup.select('.yuRUbf a'):
link = links['href']
print(link)
输出:
https://en.wikipedia.com/wiki/Edison,_New_Jersey
https://en.wikipedia.com/wiki/Motion_Picture_Patents_Company
https://www.wikipedia.com/wiki/Thomas_E._Murray
https://en.wikipedia.com/wiki/Incandescent_light_bulb
https://en.wikipedia.com/wiki/Phonograph_cylinder
https://en.wikipedia.com/wiki/Emile_Berliner
https://wikipedia.com/wiki/Consolidated_Edison
https://www.wikipedia.com/wiki/hello
https://www.wikipedia.com/wiki/Tom%20Alston
https://en.wikipedia.com/wiki/Edison_screw
或者,您可以使用来自 SerpApi 的Google 搜索引擎结果 API 。这是一个带有免费计划的付费 API。
不同之处在于,您不必弄清楚要抓取哪些 HTML 元素来提取数据,了解如何绕过 Google 或其他搜索引擎的阻止,并随着时间的推移对其进行维护(如果 HTML 中的某些内容将被更改))。
要集成的示例代码:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "site:wikipedia.com thomas edison",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(f"Link: {result['link']}")
输出:
Link: https://en.wikipedia.com/wiki/Edison,_New_Jersey
Link: https://en.wikipedia.com/wiki/Motion_Picture_Patents_Company
Link: https://www.wikipedia.com/wiki/Thomas_E._Murray
Link: https://en.wikipedia.com/wiki/Incandescent_light_bulb
Link: https://en.wikipedia.com/wiki/Phonograph_cylinder
Link: https://en.wikipedia.com/wiki/Emile_Berliner
Link: https://wikipedia.com/wiki/Consolidated_Edison
Link: https://www.wikipedia.com/wiki/hello
Link: https://www.wikipedia.com/wiki/Tom%20Alston
Link: https://en.wikipedia.com/wiki/Edison_screw
免责声明,我为 SerpApi 工作。
PS 我有一个专门的网络抓取博客。