我正在尝试从 Google 的 Earth View 页面中抓取源图像,同时将它们重命名为有意义的文件名,以便每个文件名除了文件号之外还包含城市或国家/地区。
我通过 JSON 文件实现了这一点。我发现我可以按编号查找每个文件并重定向到适当的文件。例如,/_api/1003.json
重定向到/_api/australia-1003.json
.
但是,我的工具在文件 1354、1355、2071、2090、2297、2299、5597、6058 上中断,引发以下错误:
/user/.rvm/rubies/ruby-2.2.1/lib/ruby/2.2.0/open-uri.rb:354:in `rescue in
open_http': 301 Moved Permanently (Invalid Location URI)
(OpenURI::HTTPError)
手动检查文件,我发现这八个没有在浏览器中自动重定向(使用Chrome)。每个都有“永久移动”消息并指向一个新位置。有趣的是,他们每个人的地名中都有非拉丁字符,比如عندل-yemen
和vegaøyan-norway
。快速浏览大约 1500 张成功抓取的图像,没有发现任何类似的特殊字符。
非拉丁名称会停止重定向吗?或者,Earth View 是否因为特殊字符而停止重定向?
我如何合并这些跳过的八个?
编码:
require 'open-uri'
require 'json'
require 'net/http'
@earthview_range = 1003..7023
def scrape_away
@earthview_range.each do |json_id|
response = Net::HTTP.get_response(URI.parse("http://earthview.withgoogle.com/_api/#{json_id.to_s}.json"))
if response.code.to_i < 400 # Filter 404s etc but allow redirects
# Collect the necessary information
data_hash = JSON.parse(open("http://earthview.withgoogle.com/_api/#{json_id.to_s}.json").read)
new_file_name = data_hash["slug"]
photo_url = data_hash["photoUrl"]
cleanUrl = photo_url[0..3] + photo_url[5..-1]
# Create the properly named files
File.open(new_file_name + '.jpg', 'wb') do |f|
f.write open(cleanUrl).read
end
puts "[✓]"
else
puts "--not found--"
end
end
puts ''
puts "Task complete. Have fun!"
exit
end
编辑:多亏了这篇文章,我拼凑了一个可行的解决方法。我为八种特殊情况手动提供了 URL 路径,同时使用Addressable gem来规范化非拉丁字符。
更新后的代码,就像现在一样:
require 'open-uri'
require 'json'
require 'net/http'
require 'addressable/uri'
# Range starts at 0 allowing simple 1-to-1 ID matching for exceptions.
@earthview_ary = (0..7200).to_a
# Special Exception forwarding addresses:
@earthview_ary[1354] = "sanlıurfa-merkez-turkey-1354"
@earthview_ary[1355] = "asagıkaravaiz-turkey-1355"
@earthview_ary[2071] = "عندل-yemen-2071"
@earthview_ary[2090] = "weißwasser-germany-2090"
@earthview_ary[2297] = "vegaøyan-norway-2297"
@earthview_ary[2299] = "herøy-nordland-norway-2299"
@earthview_ary[5597] = "زابل-iran-5597"
@earthview_ary[6058] = "mysove-мисове-crimea-6058"
def scrape_away
# To save time, we'll skip past the gaps I know to be empty.
# Earth View's numbering starts at 1000 and skips from 2450 to 5000.
scrape_ary = @earthview_ary.reject { |n| n.to_i.between?(1, 1000) || n.to_i.between?(2450, 5000) }
# Begin
scrape_ary.each do |json_id|
host = "http://earthview.withgoogle.com"
path = "/_api/#{json_id.to_s}.json"
uri = Addressable::URI.parse(host + path)
# The Addressable gem let's us normalize the non-Latin characters.
# Very important!
response = Net::HTTP.get_response(URI.parse(host + uri.normalized_path))
if response.code.to_i < 400 # Filter 404s etc, but allow redirects
# Collect the necessary information
data_hash = JSON.parse(open(host + uri.normalized_path).read)
new_file_name = data_hash["slug"]
print "#{new_file_name}.jpg... "
photo_url = data_hash["photoUrl"]
cleanUrl = photo_url[0..3] + photo_url[5..-1]
# Create the properly named files
File.open(new_file_name + '.jpg', 'wb') do |f|
f.write open(cleanUrl).read
end
puts "[✓]"
else
puts "--not found--"
end
end
puts ''
puts "Task complete. Have fun!"
exit
end