我正在尝试trio
使用asks
. 我使用nursery 一次启动几个爬虫,并使用内存通道来维护要访问的url 列表。
每个爬虫都会接收该通道两端的克隆,因此它们可以获取一个 url(通过 receive_channel)、读取它、查找并添加要访问的新 url(通过 send_channel)。
async def main():
send_channel, receive_channel = trio.open_memory_channel(math.inf)
async with trio.open_nursery() as nursery:
async with send_channel, receive_channel:
nursery.start_soon(crawler, send_channel.clone(), receive_channel.clone())
nursery.start_soon(crawler, send_channel.clone(), receive_channel.clone())
nursery.start_soon(crawler, send_channel.clone(), receive_channel.clone())
async def crawler(send_channel, receive_channel):
async for url in receive_channel: # I'm a consumer!
content = await ...
urls_found = ...
for u in urls_found:
await send_channel.send(u) # I'm a producer too!
在这种情况下,消费者是生产者。如何优雅地停止一切?
关闭一切的条件是:
- 频道为空
- 和
- 所有爬虫都卡在第一个 for 循环中,等待 url 出现在 receive_channel 中(这......不会再发生了)
async with send_channel
我在里面试过,crawler()
但找不到一个好方法。我还尝试找到一些不同的方法(一些内存通道绑定的工作池等),这里也没有运气。