wget - 使用 --mirror 和 --input-file 使用 wget 创建 warc

翻译自：https://stackoverflow.com/questions/68015758 2021-06-17T08:33:54.583

84 次

我有很多网站必须保存在warc.

一个简单的方法是：

$ wget --no-verbose --delete-after --no-directories \
  --page-requisites --mirror \
  --warc-cdx --warc-file=example https://example.com

对于每个网站。

但是我有一个单页列表，我需要完全确定它已经被访问过。

例如：

https://example.com/post1
https://example.com/post2
https://example.com/post3

必须保存，但不确定开始爬行的蜘蛛网是否https://example.com可以找到此链接。

所以我想urls.txt用这个内容写文件：

https://example.com
https://example.com/post1
https://example.com/post2
https://example.com/post3

并做：

$ wget --no-verbose --delete-after --no-directories \
  --page-requisites --mirror \
  --warc-cdx --warc-file=example --input-file=urls.txt

但是example.warc.gz更大得多，因为每个页面都被访问了几次。我认为wget从每个链接开始一个新的镜像，所以这就像保存网站 4 次。

如何warc使用--input-file和避免重复进行镜像？

0 回答 0