r - R：使用 xpath 过滤节点

Question

我正在尝试用 R 解析一个 html 文档。我想抓取一个节点，但在节点内有一些我不需要的信息。

例如：

<div class="content">
 <h3>Titel</h3>
 <p>content</p>
 <p>content</p>
 <ul>
  <li>List</li>
  <li>List</li>
 </ul>
</div>

我想要所有内容以及列表。我不需要标题。所以通常我会用这段代码抓住它：

grabIt <- xml_text(xml_find_all(html, xpath="//div[@class='content']//text()
                       [not(ancestor-or-self::div[@class='content']//h3)]"))

这通常工作正常。但是这里的“[not(ancestor-or-self”-Line 过滤掉了所有的内容。我认为这是因为我过滤掉了我想要抓取的节点内的东西。代码在那些情况下可以正常工作，其中标题或我不需要的任何其他信息位于单独的节点中，如下所示：

<div class="content">
 <div class="Titel">Title</div>  #difference
 <p>content</p>
 <p>content</p>
 <ul>
  <li>List</li>
  <li>List</li>
 </ul>
</div>

我的另一个想法是：

grabIt <- xml_text(xml_find_all(html, xpath="//div[@class='content']//p//text()"))

但问题是，我不能同时抓住段落和列表。

score 1 · Accepted Answer

试试这个 xpath ：

//div[@class='content']/*[not(name()='h3')][name()='p']/text() | //div[@class='content']/*[not(name()='h3')]/*[name()='li']/text()

它给：

'content'
'content'
'List'
'List'

r - R：使用 xpath 过滤节点

1 回答 1

Related

Reference