python - 如何在 XML 文件中查找特定标签，然后使用 Python 和 minidom 访问其父标签

Question

我正在尝试编写一些代码，这些代码将在文章的 XML 文件中搜索标签中包含的特定 DOI。当它找到正确的 DOI 时，我希望它访问<title>与<abstract>该 DOI 关联的文章的文本。

我的 XML 文件是这种格式：

<root>
 <article>
  <number>
   0 
  </number>
  <DOI>
   10.1016/B978-0-12-381015-1.00004-6 
  </DOI>
  <title>
   The patagonian toothfish biology, ecology and fishery. 
  </title>
  <abstract>
   lots of abstract text
  </abstract>
 </article>
 <article>
  ...All the article tags as shown above...
 </article>
</root>

我希望脚本能够找到 DOI 为 10.1016/B978-0-12-381015-1.00004-6 的文章（例如），然后让我能够访问相应标签中的<title>和标签。<abstract><article>

到目前为止，我已经尝试从这个问题中调整代码：

from xml.dom import minidom

datasource = open('/Users/philgw/Dropbox/PW-Honours-Project/Code/processed.xml')
xmldoc = minidom.parse(datasource)   

#looking for: 10.1016/B978-0-12-381015-1.00004-6

matchingNodes = [node for node in xmldoc.getElementsByTagName("DOI") if node.firstChild.nodeValue == '10.1016/B978-0-12-381015-1.00004-6']

for i in range(len(matchingNodes)):
    DOI = str(matchingNodes[i])
    print DOI

但我不完全确定我在做什么！

谢谢你的帮助。

score 1 · Accepted Answer

minidom 是必需的吗？用 lxml 和 XPath 解析它会很容易。

from lxml import etree
datasource = open('/Users/philgw/Dropbox/PW-Honours-Project/Code/processed.xml').read()
tree = etree.fromstring(datasource)
path = tree.xpath("//article[DOI="10.1016/B978-0-12-381015-1.00004-6")

这将为您提供指定 DOI 的文章。

此外，标签之间似乎有空格。我不知道这是否是因为 Stackoverflow 格式。这可能就是您无法将其与 minidom 匹配的原因。

score 0 · Accepted Answer

恕我直言 - 只需在 python 文档中查找！试试这个（未测试）：

from xml.dom import minidom

xmldoc = minidom.parse(datasource)   

def get_xmltext(parent, subnode_name):
    node = parent.getElementsByTagName(subnode_name)[0]
    return "".join([ch.toxml() for ch in node.childNodes])

matchingNodes = [node for node in xmldoc.getElementsByTagName("article")
           if get_xmltext(node, "DOI") == '10.1016/B978-0-12-381015-1.00004-6']

for node in matchingNodes:
    print "title:", get_xmltext(node, "title")
    print "abstract:", get_xmltext(node, "abstract")

python - 如何在 XML 文件中查找特定标签，然后使用 Python 和 minidom 访问其父标签

2 回答 2

Related

Reference