0

我正在抓取一系列网址。该代码正在运行,但scrapy 没有按顺序解析网址。例如,虽然我正在尝试解析 url1、url2、...、url100,但 scrapy 会解析 url2、url10、url1...等。

它会解析所有的 url,但是当特定的 url 不存在时(例如 example.com/unit.aspx?b_id=10),Firefox 会显示我之前请求的结果。由于我想确保没有重复项,因此我需要确保循环按顺序解析 url 而不是“随意”解析。

我尝试了“for n in range(1,101)”和“while bID<100”,结果是一样的。(见下文)

提前致谢!

def check_login_response(self, response):
    """Check the response returned by a login request to see if we are
    successfully logged in.
    """
    if "Welcome!" in response.body:
        self.log("Successfully logged in. Let's start crawling!")
        print "Successfully logged in. Let's start crawling!"
        # Now the crawling can begin..
        self.initialized()
        bID=0
        #for n in range(1,100,1):
        while bID<100:
            bID=bID+1
            startURL='https://www.example.com/units.aspx?b_id=%d' % (bID)
            request=Request(url=startURL ,dont_filter=True,callback=self.parse_add_tables,meta={'bID':bID,'metaItems':[]})
            # print self.metabID
            yield request #Request(url=startURL ,dont_filter=True,callback=self.parse2)
    else:
        self.log("Something went wrong, we couldn't log in....Bad times :(")
        # Something went wrong, we couldn't log in, so nothing happens.
4

2 回答 2

0

您可以在请求对象中使用使用优先级属性。Scrapy 保证默认情况下在 DFO 中抓取 url。但它不能确保按照它们在解析回调中产生的顺序访问 url。

您不想产生 Request 对象,而是希望返回一个 Requests 数组,从中弹出对象直到它为空。

有关更多信息,您可以在此处查看

按顺序抓取 URL

于 2013-02-02T06:45:40.647 回答
0

你可以试试这样的。我不确定它是否适合目的,因为我没有看到蜘蛛代码的其余部分,但是你去:

# create a list of urls to be parsed, in reverse order (so we can easily pop items off)
crawl_urls = ['https://www.example.com/units.aspx?b_id=%s' % n for n in xrange(99,1,-1)]

def check_login_response(self, response):
    """Check the response returned by a login request to see if we are successfully logged in.
    """
    if "Welcome!" in response.body:
        self.log("Successfully logged in. Let's start crawling!")
        print "Successfully logged in. Let's start crawling!"
        # Now the crawling can begin..
        self.initialized()
        return Request(url='https://www.example.com/units.aspx?b_id=1',dont_filter=True,callback=self.parse_add_tables,meta={'bID':1,'metaItems':[]})
    else:
        self.log("Something went wrong, we couldn't log in....Bad times :(")
        # Something went wrong, we couldn't log in, so nothing happens.

def parse_add_tables(self, response):
    # parsing code here
    if self.crawl_urls:
        next_url = self.crawl_urls.pop()
        return Request(url=next_url,dont_filter=True,callback=self.parse_add_tables,meta={'bID':int(next_url[-1:]),'metaItems':[]})

    return items
于 2013-02-02T17:59:04.057 回答