1

我是一名学习生物的非 cs 学生,但我正在研究 python 数据科学,目的是为了网页抓取 Google Scholar。我创建了一个最初可以工作的程序,但不知何故它随机停止工作并给了我一个错误值。我认为这可能与谷歌严格限制机器人搜索他们的网站有关。任何建议和补救措施都会有所帮助!我正在使用 Jupyter Notebook ipython 和 Python3。

代码:

import pip    
def install(package):
    pip.main(['install', package])

install('BeautifulSoup4')

from bs4 import BeautifulSoup
import urllib.request
from urllib.request import FancyURLopener

class AppURLopener(urllib.request.FancyURLopener):
    version = "Mozilla/5.0"

def page_citations(x):
    #number of pages of google searches that you want to run

    query = input()
    query = str(query)
    opener = AppURLopener()
    m = 0
    q = 0
    l = make_array()
    while m < x:
        response = 
        opener.open('https://scholar.google.com/scholar?
        start='+str(q)+'&q=' + query + '&hl=en&as_sdt=0,5').read()
        soup = BeautifulSoup(response, 'html.parser')
        for word in str(soup.find_all(class_ = "gs_fl")).split():
            if word.endswith(''+ '</a>'): 
                l = np.append(l, word.strip('</a>'))
        q = q + 10
        m = m + 1
    n = make_array()

    for number in l:
        try:
            number = int(number)
            n = np.append(n, number)
        except: continue

    return n

错误:ValueError:读取已关闭文件

4

0 回答 0