0

我正在使用这个 dashing 小部件中的以下 ruby​​ 脚本,该小部件检索 RSS 提要并对其进行解析,并将解析后的标题和描述发送到小部件。

require 'net/http'
require 'uri'
require 'nokogiri'
require 'htmlentities'

news_feeds = {
  "seattle-times" => "http://seattletimes.com/rss/home.xml",
}

Decoder = HTMLEntities.new

class News
  def initialize(widget_id, feed)
    @widget_id = widget_id
    # pick apart feed into domain and path
    uri = URI.parse(feed)
    @path = uri.path
    @http = Net::HTTP.new(uri.host)
  end

  def widget_id()
    @widget_id
  end

  def latest_headlines()
    response = @http.request(Net::HTTP::Get.new(@path))
    doc = Nokogiri::XML(response.body)
    news_headlines = [];
    doc.xpath('//channel/item').each do |news_item|
      title = clean_html( news_item.xpath('title').text )
      summary = clean_html( news_item.xpath('description').text )
      news_headlines.push({ title: title, description: summary })
    end
    news_headlines
  end

  def clean_html( html )
    html = html.gsub(/<\/?[^>]*>/, "")
    html = Decoder.decode( html )
    return html
  end

end

@News = []
news_feeds.each do |widget_id, feed|
  begin
    @News.push(News.new(widget_id, feed))
  rescue Exception => e
    puts e.to_s
  end
end

SCHEDULER.every '60m', :first_in => 0 do |job|
  @News.each do |news|
    headlines = news.latest_headlines()
    send_event(news.widget_id, { :headlines => headlines })
  end
end

示例 rss 提要可以正常工作,因为 URL 用于 xml 文件。但是,我想将此用于不提供实际 xml 文件的不同 rss 提要。我想要的这个 rss 提要位于http://www.ttc.ca/RSS/Service_Alerts/index.rss 这似乎没有在小部件上显示任何内容。我没有使用“ http://www.ttc.ca/RSS/Service_Alerts/index.rss ”,而是尝试了“ http://www.ttc.ca/RSS/Service_Alerts/index.rss?format=xml ”和“查看源代码:http ://www.ttc.ca/RSS/Service_Alerts/index.rss ”但没有运气。有谁知道我如何获取与这个 rss 提要相关的实际 xml 数据,以便我可以将它与这个 ruby​​ 脚本一起使用?

4

1 回答 1

2

没错,该链接不提供常规 XML,因此该脚本无法解析它,因为它是专门为解析示例 XML 而编写的。您尝试解析的 rss 提要提供 RDF XML,您可以使用 Rubygem: RDFXML来解析它。

就像是:

require 'nokogiri'
require 'rdf/rdfxml'

rss_feed = 'http://www.ttc.ca/RSS/Service_Alerts/index.rss'

RDF::RDFXML::Reader.open(rss_feed) do |reader|
  # use reader to iterate over elements within the document
end

从这里您可以尝试学习如何使用 RDFXML 来提取您想要的内容。我首先检查阅读器对象是否有我可以使用的方法:

puts reader.methods.sort - Object.methods

这将打印出读者自己的方法,寻找一种您可能能够用于您的目的的方法,例如reader.each_entry

要进一步挖掘,您可以检查每个条目的外观:

reader.each_entry do |entry|
  puts "----here's an entry----" 
  puts entry.inspect
end

或查看您可以在条目上调用哪些方法:

reader.each_entry do |entry|
  puts "----here's an entry's methods----" 
  puts entry.methods.sort - Object.methods
  break
end

我能够使用这个 hack 工作粗略地找到一些标题和描述:

RDF::RDFXML::Reader.open('http://www.ttc.ca/RSS/Service_Alerts/index.rss') do |reader|
  reader.each_object do |object|
    puts object.to_s if object.is_a? RDF::Literal
  end
end

# returns:

# TTC Service Alerts
# http://www.ttc.ca/Service_Advisories/index.jsp

#      TTC Service Alerts.

# TTC.ca
# http://www.ttc.ca
# http://www.ttc.ca/images/ttc-main-logo.gif
# Service Advisory
# http://www.ttc.ca/Service_Advisories/all_service_alerts.jsp#Service+Advisory

# 196 York University Rocket route diverting northbound via Sentinel, Finch due to a collision that has closed the York U Bus way.
# - Affecting: Bus Routes: 196 York University Rocket
# 2013-12-17T13:49:03.800-05:00
# Service Advisory (2)
# http://www.ttc.ca/Service_Advisories/all_service_alerts.jsp#Service+Advisory+(2)

# 107B Keele North route diverting northbound via Keele, Lepage due to a collision that has closed the York U Bus way.
# - Affecting: Bus Routes: 107 Keele North
# 2013-12-17T13:51:08.347-05:00

但是我不能很快找到一种方法来知道哪个是标题,哪个是描述:/

最后,如果您仍然找不到如何提取您想要的内容,请使用此信息开始一个新问题。

祝你好运!

于 2013-12-17T20:51:03.230 回答