python - 构建这个扭曲的网络爬虫时我做错了什么？

Question

你能以我扭曲的方式告诉我错误吗？很长一段时间以来，我一直在努力构建一个快速的网络爬虫。使用 Queue构建传统的线程刮板是小菜一碟，到目前为止，速度更快。不过，我想比较扭曲！webscraper 的目标是递归地从图库中找到图像 () 链接，并连接到这些图像链接以抓取图像 () 和/或收集更多图像链接以供以后解析。代码如下所示。大多数函数都传递一个字典，因此我可以更概念地对每个链接的所有信息进行打包。我尝试线程化否则会阻塞代码（parsePage 函数）并使用“异步代码”（或者我相信）来检索 html 页面、标题信息和图像。

到目前为止，我的主要问题是从我的 getLinkHTML 或 getImgHeader errback 中追踪到大量“用户超时导致连接失败”。我曾尝试限制使用信号量建立的连接数量，甚至导致我的一些代码休眠无济于事，以为我淹没了连接。我还认为问题可能来自 reactor.connectTCP 因为超时错误在运行scraper大约 30 秒后产生，并且 connectTCP 有 30 秒超时。但是，我将twisted模块的connectTCP代码修改为60s，运行后大约30秒仍然出现超时错误。当然，用我传统的螺纹刮刀刮取相同的网站工作正常，而且要快得多。

那么我做错了什么？另外，由于我是自学成才的，请随时对我的代码提出批评，并且我在整个代码中也有一些随机问题。任何建议都非常感谢！

from twisted.internet import defer
from twisted.internet import reactor
from twisted.web import client
from lxml import html
from StringIO import StringIO
from os import path
import re

start_url = "http://www.thesupermodelsgallery.com/"
directory = "/home/z0e/Pictures/Pix/Twisted"
min_img_size = 100000

#maximum <a> links to get from main gallery
max_gallery_links = 500

#maximum <a> links to get from subsequent gallery/pages
max_picture_links = 35

def parsePage(info):
         
    def linkFilter(link):
    #filter unwanted <a> links
    if link is not None:
        trade_match = re.search(r'&trade=', link)
        href_split = link.split('=')
        for i in range(len(href_split)):
            if 'www' in href_split[i] and i > 0:
                link = href_split[i]
        end_pattern = r'\.(com|com/|net|net/|pro|pro/)$'
        end_match = re.search(end_pattern, link)
        p_pattern = r'(.*)&p'
        p_match = re.search(p_pattern, link)
        if end_match or trade_match:
            return None
        elif p_match:
            link = p_match.group(1)
            return link
        else:
            return link
    else:
        return None
        
    # better to handle a link with 'None' value through TypeError
    # exception or through if else statements?  Compare linkFilter
    # vs. imgFilter functions
        
    def imgFilter(link):
    #filter <img> links to retain only .jpg
    try:
        jpg_match = re.search(r'.jpg', link)
        if jpg_match is not None:
            return link
        else:
            return None
    except TypeError:
        return None
        
    link_num = 0
    gallery_flag = None
    info['level'] += 1
    if info['page'] is '':
    return None
    # use lxml to parse and get document root
    tree = html.parse(StringIO(info['page']))
    root = tree.getroot()
    root.make_links_absolute(info['url'])
    # info['level'] = 1 corresponds to first recursive layer (i.e. main gallery page)
    # info['level'] > 1 will be all other <a> links from main gallery page
    if info['level'] == 1:
    link_cap = max_gallery_links
    gallery_flag = True
    else:
    link_cap = max_picture_links
    gallery_flag = False
    if info['level'] > 4:
    return None
    else:
    
    # get <img> links if page is not main gallery ('gallery_flag = False')
    # put <img> links back into main event loop to extract header information
    # to judge pictures by picture size (i.e. content-length)
    if not gallery_flag:
        for elem in root.iter('img'):
            # create copy of info so that dictionary no longer points to 
            # previous dictionary, but new dictionary for each link
            info = info.copy()
            info['url'] = imgFilter(elem.get('src'))
            if info['url'] is not None:
                reactor.callFromThread(getImgHeader, info) 
                
    # get <a> link and put work back into main event loop (i.e. w/ 
    # reactor.callFromThread...) to getPage and then parse, continuing the
    # cycle of linking        
    for elem in root.iter('a'):
        if link_num > link_cap:
            break
        else:
            img = elem.find('img')
            if img is not None:
                link_num += 1
                info = info.copy()
                info['url'] = linkFilter(elem.get('href'))
                if info['url'] is not None:
                    reactor.callFromThread(getLinkHTML, info)
                    
def getLinkHTML(info):
    # get html from <a> link and then send page to be parsed in a thread
    d = client.getPage(info['url'])
    d.addCallback(parseThread, info)
    d.addErrback(failure, "getLink Failure: " + info['url'])
    
def parseThread(page, info):
    print 'parsethread:', info['url']
    info['page'] = page
    reactor.callInThread(parsePage, info)

def getImgHeader(info):
    # get <img> header information to filter images by image size
    agent = client.Agent(reactor)
    d = agent.request('HEAD', info['url'], None, None)
    d.addCallback(getImg, info)
    d.addErrback(failure, "getImgHeader Failure: " + info['url'])

def getImg(img_header, info):
    # download image only if image is above a certain threshold size
    img_size = img_header.headers.getRawHeaders('Content-Length')  
    if int(img_size[0]) > min_img_size and img_size is not None:
    img_name = ''.join(map(urlToName, info['url']))
    client.downloadPage(info['url'], path.join(directory, img_name))
    else:
    img_header, link = None, None #Does this help garbage collecting?
    
def urlToName(char):
    #convert all unwanted characters to '-' from url and use as file name
    if char in '/\?|<>"':
    return '-'
    else:
    return char
    
def failure(error, url):
    print error
    print url

def main():
    info = dict()
    info['url'] = start_url
    info['level'] = 0
    
    reactor.callWhenRunning(getLinkHTML, info)    
    reactor.suggestThreadPoolSize(2)
    reactor.run()
    
if __name__ == "__main__":
    main()

score 2 · Accepted Answer

First, consider not writing this code at all. Take a look at scrapy as a solution to your needs. People have already gone to the effort of making it perform well, and if it does need to be improved, then when you improve it everyone in the community will benefit.

Next, the indentation in your code listing is unfortunately messed up, making it hard to really see what your code is doing. Hopefully the following makes sense, but you should try to correct the code listing so it accurately reflects what you're doing, and make sure to double check code listings in future questions.

As far as what your code is doing that is preventing it from being fast, here are some ideas.

There's no limit to the number of outstanding HTTP requests in the program. Without knowing what HTML you're actually parsing, I don't know if this is actually a problem, but if you end up issuing more than 20 or 30 HTTP requests at a time, it's very likely that you'll overload your network. With TCP, this often means that connection setup will not succeed (certain setup packets get lost and there is a limit on how many times they will be retried). Since you mentioned a lot of connection timeout errors, I suspect this is happening.

Consider how many HTTP requests the threaded version of your program will issue at a time. Does the Twisted version potentially issue more? If so, try imposing a limit on this. Something like twisted.internet.defer.DeferredSemaphore might be an easy way to impose this limit (although it's far from the best way, so if it helps then you might want to start looking at better ways to impose this limit - but if the limit doesn't help then no point investing a lot of effort in a nicer limiting mechanism).

Next, by limiting the reactor threadpool to a maximum of 2 threads, you're severely hampering your ability to resolve names. By default, name resolution (ie, DNS) is done using the reactor thread pool. You have a couple options here. I'm assuming there's a good reason you want to limit parsing to two concurrent threads.

First, you could leave the reactor threadpool alone and create your own thread pool for parsing. See twisted.python.threads.ThreadPool. You can set the maximum on this other thread pool to 2 to get the parsing behavior you want and the reactor is free to use as many threads as it wants for name resolution.

Second, you could continue to lower the reactor thread pool size and also configure the reactor to not use threads for name resolution. twisted.names.client.createResolver will give you a name resolver which does just that, and reactor.installResolver lets you tell the reactor to use it instead of its default.

python - 构建这个扭曲的网络爬虫时我做错了什么？

1 回答 1

Related

Reference