python - python - 在合理的时间内用登录刮掉许多 URL

Question

我正在尝试从需要登录才能查看实际内容的网站上抓取一些数据。一切正常，但每个请求大约需要 5 秒，这对于我的需求来说是一种放慢速度的方式（>5000 个要抓取的 url）。似乎有更快的方法，比如 asyncio aiohttp 模块。但是，我在网上找到的所有示例都没有显示如何登录网站然后使用这些工具。

所以我基本上需要一个易于理解的例子来做这样的事情。

我尝试使用我的代码重建此示例： https ://realpython.com/python-concurrency/#what-is-concurrency ，但它不起作用。我还尝试了 requests_html 中的 AsyncHTMLSession() ，它返回了一些东西，但似乎不记得登录了。

到目前为止，这是我的代码：

import requests
from bs4 import BeautifulSoup

payload = {
"name" : "username",
"password" : "example_pass",
"destination" : "MAS_Management_UserConsole",
"loginType" : ""
}

links = [several urls]

### stuff with requests
with requests.Session() as c:
    c.get('http://boldsystems.org/')
    c.post('http://boldsystems.org/index.php/Login', data = payload)

def return_id(link):
    page = c.get(link).content
    soup = BeautifulSoup(page, 'html.parser')
    return soup.find(id = 'processidLC').text

for link in links:
    print(return_id(link))

score 0 · Accepted Answer

看起来您已经在使用requests，因此您可以尝试requests-async。下面的示例应该可以帮助您解决问题的“在合理的时间”部分，只需parse_html相应地调整功能以搜索您的 HTML 标记。默认情况下，它将并行运行 50 个请求 ( MAX_REQUESTS)，以免耗尽系统上的资源（文件描述符等）。

例子：

import asyncio
import requests_async as requests
import time

from bs4 import BeautifulSoup
from requests_async.exceptions import HTTPError, RequestException, Timeout


MAX_REQUESTS = 50
URLS = [
    'http://envato.com',
    'http://amazon.co.uk',
    'http://amazon.com',
    'http://facebook.com',
    'http://google.com',
    'http://google.fr',
    'http://google.es',
    'http://google.co.uk',
    'http://internet.org',
    'http://gmail.com',
    'http://stackoverflow.com',
    'http://github.com',
    'http://heroku.com',
    'http://djangoproject.com',
    'http://rubyonrails.org',
    'http://basecamp.com',
    'http://trello.com',
    'http://yiiframework.com',
    'http://shopify.com',
    'http://airbnb.com',
    'http://instagram.com',
    'http://snapchat.com',
    'http://youtube.com',
    'http://baidu.com',
    'http://yahoo.com',
    'http://live.com',
    'http://linkedin.com',
    'http://yandex.ru',
    'http://netflix.com',
    'http://wordpress.com',
    'http://bing.com',
]


class BaseException(Exception):
    pass


class HTTPRequestFailed(BaseException):
    pass


async def fetch(url, timeout=5):
    async with requests.Session() as session:
        try:
            resp = await session.get(url, timeout=timeout)
            resp.raise_for_status()
        except HTTPError:
            raise HTTPRequestFailed(f'Skipped: {resp.url} ({resp.status_code})')
        except Timeout:
            raise HTTPRequestFailed(f'Timeout: {url}')
        except RequestException as e:
            raise HTTPRequestFailed(e)
    return resp


async def parse_html(html):
    bs = BeautifulSoup(html, 'html.parser')
    if not html: print(html)
    title = bs.title.text.strip()
    return title if title else "Unknown"


async def run(sem, url):
    async with sem:
        start_t = time.time()
        resp = await fetch(url)
        title = await parse_html(resp.text)
        end_t = time.time()
        elapsed_t = end_t - start_t
        r_time = resp.elapsed.total_seconds()
        print(f'{url}, title: "{title}" (total: {elapsed_t:.2f}s, request: {r_time:.2f}s)')
        return resp


async def main():
    sem = asyncio.Semaphore(MAX_REQUESTS)
    tasks = [asyncio.create_task(run(sem, url)) for url in URLS]
    for f in asyncio.as_completed(tasks):
        try:
            result = await f
        except Exception as e:
            print(e)


if __name__ == '__main__':
    asyncio.run(main())

输出：

# time python req.py 
http://google.com, title: "Google" (total: 0.69s, request: 0.58s)
http://yandex.ru, title: "Яндекс" (total: 2.01s, request: 1.65s)
http://github.com, title: "The world’s leading software development platform · GitHub" (total: 2.12s, request: 1.90s)
Timeout: http://yahoo.com
...

real    0m6.868s
user    0m3.723s
sys 0m0.524s

现在，这可能仍然无法帮助您解决日志记录问题。您正在寻找的 HTML 标记（或整个网页）可以由 JavaScript 生成，因此您需要requests-html使用无头浏览器来读取 JavaScript 呈现的内容的工具。

您的登录表单也可能使用 CSRF 保护，例如登录到 Django 管理后端：

>>> import requests
>>> s = requests.Session()
>>> get = s.get('http://localhost/admin/')
>>> csrftoken = get.cookies.get('csrftoken')
>>> payload = {'username': 'admin', 'password': 'abc123', 'csrfmiddlewaretoken': csrftoken, 'next': '/admin/'}
>>> post = s.post('http://localhost/admin/login/?next=/admin/', data=payload)
>>> post.status_code
200

我们首先使用 session 执行获取请求，从csrftokencookie 中获取令牌，然后使用两个隐藏的表单字段登录：

<form action="/admin/login/?next=/admin/" method="post" id="login-form">
  <input type="hidden" name="csrfmiddlewaretoken" value="uqX4NIOkQRFkvQJ63oBr3oihhHwIEoCS9350fVRsQWyCrRub5llEqu1iMxIDWEem">
  <div class="form-row">
    <label class="required" for="id_username">Username:</label>
    <input type="text" name="username" autofocus="" required="" id="id_username">
  </div>
  <div class="form-row">
    <label class="required" for="id_password">Password:</label> <input type="password" name="password" required="" id="id_password">
    <input type="hidden" name="next" value="/admin/">
  </div>
    <div class="submit-row">
    <label>&nbsp;</label>
    <input type="submit" value="Log in">
  </div>
</form>

注意：示例使用 Python 3.7+

score 0 · Accepted Answer

查看 asyncio 并使用 asyncio.gather 函数。

将“links = [several urls]”行下方的所有内容包装在一个方法中。

注意这不是线程安全的，所以不要更改方法中的变量。

这也是线程，因此使用 asyncio.sleep(randint(0,2)) 来延迟一些线程可能很有用，因此它不会同时触发。

然后使用 asyncio 使用新的 url 调用下面的方法，如下所示

tasks =[]
for url in urls:
    tasks.append(wrapped_method(url))

results = asyncio.gather(*tasks)

希望有帮助。

否则请查看https://github.com/jreese/aiomultiprocess

python - python - 在合理的时间内用登录刮掉许多 URL

2 回答 2

Related

Reference