2

为了从常见的爬网中读取一些文件,我编写了这个脚本

import warc
import boto    

for line in sys.stdin:
        line = line.strip()
        #Connect to AWS and read a dataset
        conn = boto.connect_s3(anon=True, host='s3.amazonaws.com')
        pds = conn.get_bucket('commoncrawl')
        k = Key(pds)
        k.key = line

        f = warc.WARCFile(fileobj=GzipStreamFile(k))
        skipped_doc = 0
        for num, record in enumerate(f):
            # analysis code

其中每一行都是warc文件的key。当我运行这个脚本来分析 5 个文件时,我得到了这个异常

Traceback (most recent call last):
  File "./warc_mapper_full.py", line 42, in <module>
    for num, record in enumerate(f):
  File "/usr/lib/python2.7/site-packages/warc/warc.py", line 393, in __iter__
    record = self.read_record()
  File "/usr/lib/python2.7/site-packages/warc/warc.py", line 364, in read_record
    self.finish_reading_current_record()
  File "/usr/lib/python2.7/site-packages/warc/warc.py", line 358, in finish_reading_current_record
    self.current_payload.read()
  File "/usr/lib/python2.7/site-packages/warc/utils.py", line 59, in read
    return self._read(self.length)
  File "/usr/lib/python2.7/site-packages/warc/utils.py", line 69, in _read
    content = self.buf + self.fileobj.read(size)
  File "/home/hpcnl/Documents/kics/current_work/aws/tasks/warc-analysis/src/gzipstream/gzipstream/gzipstreamfile.py", line 67, in read
    result = super(GzipStreamFile, self).read(*args, **kwargs)
  File "/home/hpcnl/Documents/kics/current_work/aws/tasks/warc-analysis/src/gzipstream/gzipstream/gzipstreamfile.py", line 48, in readinto
    data = self.read(len(b))
  File "/home/hpcnl/Documents/kics/current_work/aws/tasks/warc-analysis/src/gzipstream/gzipstream/gzipstreamfile.py", line 38, in read
    raw = self.stream.read(io.DEFAULT_BUFFER_SIZE)
  File "/usr/lib/python2.7/site-packages/boto/s3/key.py", line 400, in read
    data = self.resp.read(size)
  File "/usr/lib/python2.7/site-packages/boto/connection.py", line 413, in read
    return http_client.HTTPResponse.read(self, amt)
  File "/usr/lib64/python2.7/httplib.py", line 602, in read
    s = self.fp.read(amt)
  File "/usr/lib64/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
  File "/usr/lib64/python2.7/ssl.py", line 736, in recv
    return self.read(buflen)
  File "/usr/lib64/python2.7/ssl.py", line 630, in read
    v = self._sslobj.read(len or 1024)
ssl.SSLError: ('The read operation timed out',)

我运行了很多次。上述异常每次都发生。问题出在哪里 ?

4

0 回答 0