r - 在 R 中解析非嵌套的 XML 标签

Question

我正在尝试使用出色的xml2R 库解析许多文档。例如，考虑以下 XML 文件：

pg <- read_xml("https://www.theyworkforyou.com/pwdata/scrapedxml/westminhall/westminster2001-01-24a.xml")

其中包含许多<speech>标签，这些标签是分开的，尽管没有嵌套在其中，但也有许多<minor-heading>和<major-heading>标签。我想将此文档处理为data.frame具有以下结构的结果：

     major_heading_id  speech_text
     heading_id_1       text1
     heading_id_1       text2
     heading_id_2       text3
     heading_id_2       text4

不幸的是，因为标签没有嵌套，我不知道该怎么做！我有成功恢复相关信息的代码（见下文），但将语音标签与其各自的主要标题匹配是我无法做到的。

我的直觉是，最好在标题标签处拆分 XML 文档，然后将每个文档作为单独的文档处理，但我在xml2包中找不到可以让我这样做的函数！

任何帮助都会很棒。

到目前为止我已经到达的地方：

speech_recs <- xml_find_all(pg, "//speech")
speech_text <- trimws(xml_text(speech_recs))

heading_recs <- xml_find_all(pg, "//major-heading")
major_heading_id <- xml_attr(heading_recs, "id")

score 1 · Accepted Answer

您可以按如下方式执行此操作：

require(xml2)
require(tidyverse)
doc <- read_xml("https://www.theyworkforyou.com/pwdata/scrapedxml/westminhall/westminster2001-01-24a.xml")

# Get the headings
heading_recs <- xml_find_all(doc, "//major-heading")

# path creates the structure you want
# so the speech nodes that have exactly n headings above them.
path <- sprintf("//speech[count(preceding-sibling::major-heading)=%d]", 
                seq_along(heading_recs))

# Get the text of the speech nodes
map(path, ~xml_text(xml_find_all(doc, .x))) %>% 
# Combine it with the id of the headings
  map2_df(xml_attr(heading_recs, "id"), 
          ~tibble(major_heading_id = .y, speech_text = .x))

这导致：

r - 在 R 中解析非嵌套的 XML 标签

1 回答 1

Related

Reference