我可以通过以下方式获得 Common Crawl 的列表:
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-09/wet.paths.gz
如何使用 Common Crawl News Dataset 做到这一点?
我尝试了不同的选项,但总是出错:
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS-2017-09/warc.paths.gz
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2017/09/warc.paths.gz