common-crawl - 从 Common Crawl 索引服务器获取 WAT 存档子集的偏移量和长度

Question

我想从 Amazon S3 下载 WAT 存档段的子集。

背景：

在http://index.commoncrawl.org上搜索 Common Crawl 索引会产生包含 AWS S3 上 WARC 文件位置信息的结果。例如，搜索url=www.celebuzz.com/2017-01-04/*&output=json会产生 JSON 格式的结果，其中之一是

{ "urlkey":"com,celebuzz)/2017-01-04/watch-james-corden-george-michael-tribute", ... "filename":"crawl-data/CC-MAIN-2017-34/segments/1502886104631.25/warc/CC-MAIN-20170818082911-20170818102911-00023.warc.gz", ... "offset":"504411150", "length":"14169", ... }

该filename条目指示哪个存档段包含此特定页面的 WARC 文件。这个存档文件很大；但幸运的是，该条目还包含offset和length字段，可用于请求包含存档段相关子集的字节范围（例如，参见本要点中的第 22-30 行）。

我的问题：

给定 WARC 文件段的位置，我知道如何构造相应 WAT 存档段的名称（例如，参见本教程）。我只需要 WAT 文件的一个子集，所以我想请求一个字节范围。但是如何找到 WAT 存档段的相应偏移量和长度？

我已经检查了Common Crawl 索引服务器的API 文档，但我不清楚这是否可行。但如果是这样，我会发布这个问题。

score 4 · Accepted Answer

Common Crawl 索引不包含 WAT 和 WET 文件的偏移量。因此，唯一的方法是在整个 WAT/WET 文件中搜索所需的记录/URL。最终，可以估计偏移量，因为 WARC 和 WAT/WET 文件中的记录顺序是相同的。

score 1 · Accepted Answer

经过多次试验和错误，我设法通过以下方式从python和boto3中的warc文件中获取范围：

# You have this form the index
offset, length, filename = 2161478, 12350, "crawl-data/[...].warc.gz"

import boto3
from botocore import UNSIGNED
from botocore.client import Config

# Boto3 anonymous login to common crawl
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))

# Count the range
offset_end = offset + length - 1
byte_range = 'bytes={offset}-{end}'.format(offset=2161478, end=offset_end)
gzipped_text = s3.get_object(Bucket='commoncrawl', Key=filename, Range=byte_range)['Body'].read()

# The requested file in GZIP
with open("file.gz", 'w') as f:
  f.write(gzipped_text)

剩下的就是优化......希望它有所帮助！:)

common-crawl - 从 Common Crawl 索引服务器获取 WAT 存档子集的偏移量和长度

2 回答 2

Related

Reference