2

我正在尝试从 Google 的 Earth View 页面中抓取源图像,同时将它们重命名为有意义的文件名,以便每个文件名除了文件号之外还包含城市或国家/地区。

我通过 JSON 文件实现了这一点。我发现我可以按编号查找每个文件并重定向到适当的文件。例如,/_api/1003.json重定向到/_api/australia-1003.json.

但是,我的工具在文件 1354、1355、2071、2090、2297、2299、5597、6058 上中断,引发以下错误:

/user/.rvm/rubies/ruby-2.2.1/lib/ruby/2.2.0/open-uri.rb:354:in `rescue in
open_http': 301 Moved Permanently (Invalid Location URI)
(OpenURI::HTTPError)

手动检查文件,我发现这八个没有在浏览器中自动重定向(使用Chrome)。每个都有“永久移动”消息并指向一个新位置。有趣的是,他们每个人的地名中都有非拉丁字符,比如عندل-yemenvegaøyan-norway。快速浏览大约 1500 张成功抓取的图像,没有发现任何类似的特殊字符。

非拉丁名称会停止重定向吗?或者,Earth View 是否因为特殊字符而停止重定向?

我如何合并这些跳过的八个?

编码:

require 'open-uri'
require 'json'
require 'net/http'

@earthview_range = 1003..7023

def scrape_away
  @earthview_range.each do |json_id|
    response = Net::HTTP.get_response(URI.parse("http://earthview.withgoogle.com/_api/#{json_id.to_s}.json"))
    if response.code.to_i < 400   # Filter 404s etc but allow redirects

      # Collect the necessary information
      data_hash = JSON.parse(open("http://earthview.withgoogle.com/_api/#{json_id.to_s}.json").read)
      new_file_name = data_hash["slug"]  
      photo_url = data_hash["photoUrl"]  
      cleanUrl = photo_url[0..3] + photo_url[5..-1]

      # Create the properly named files
      File.open(new_file_name + '.jpg', 'wb') do |f|
        f.write open(cleanUrl).read
      end
      puts "[✓]"
    else
      puts "--not found--"
    end
  end
  puts ''
  puts "Task complete. Have fun!"
  exit
end

编辑:多亏了这篇文章,我拼凑了一个可行的解决方法。我为八种特殊情况手动提供了 URL 路径,同时使用Addressable gem来规范化非拉丁字符。

更新后的代码,就像现在一样:

require 'open-uri'
require 'json'
require 'net/http'
require 'addressable/uri'

# Range starts at 0 allowing simple 1-to-1 ID matching for exceptions.
@earthview_ary = (0..7200).to_a

# Special Exception forwarding addresses:
@earthview_ary[1354] = "sanlıurfa-merkez-turkey-1354"
@earthview_ary[1355] = "asagıkaravaiz-turkey-1355"
@earthview_ary[2071] = "عندل-yemen-2071"
@earthview_ary[2090] = "weißwasser-germany-2090"
@earthview_ary[2297] = "vegaøyan-norway-2297"
@earthview_ary[2299] = "herøy-nordland-norway-2299"
@earthview_ary[5597] = "زابل-iran-5597"
@earthview_ary[6058] = "mysove-мисове-crimea-6058"

def scrape_away
  # To save time, we'll skip past the gaps I know to be empty.
  # Earth View's numbering starts at 1000 and skips from 2450 to 5000.
  scrape_ary = @earthview_ary.reject { |n| n.to_i.between?(1, 1000) || n.to_i.between?(2450, 5000) }

  # Begin
  scrape_ary.each do |json_id|
    host = "http://earthview.withgoogle.com"
    path = "/_api/#{json_id.to_s}.json"
    uri = Addressable::URI.parse(host + path)
    # The Addressable gem let's us normalize the non-Latin characters.
    # Very important!
    response = Net::HTTP.get_response(URI.parse(host + uri.normalized_path))
    if response.code.to_i < 400   # Filter 404s etc, but allow redirects

      # Collect the necessary information
      data_hash = JSON.parse(open(host + uri.normalized_path).read)
      new_file_name = data_hash["slug"]
      print "#{new_file_name}.jpg... "
      photo_url = data_hash["photoUrl"]
      cleanUrl = photo_url[0..3] + photo_url[5..-1]

      # Create the properly named files
      File.open(new_file_name + '.jpg', 'wb') do |f|
        f.write open(cleanUrl).read
      end
      puts "[✓]"
    else
      puts "--not found--"
    end
  end
  puts ''
  puts "Task complete. Have fun!"
  exit
end
4

0 回答 0