我想提取下面粘贴的 HTML 页面部分中存在的文本“Catholic Blended Margaritas”。
我同样使用了以下 xPath 表达式:
xPath = "//div[@class='recipeBox']/div[@class='detailBox']/h3/text()";
我将它传递给 HTMLCleaner,我将其部分代码粘贴在这里:
//use the cleaner to "clean" the HTML and return it as a TagNode object i.e. HTML page root node
TagNode rootNode = htmlCleaner.clean(new InputStreamReader(conn.getInputStream()));
// query XPath
Object[] nodes = rootNode.evaluateXPath(xpath);
但是上面的表达式返回零节点。
我已经粘贴了 Html 的部分。事实上,我想要所有这些节点的文本,我只粘贴了 Html 的一部分。供您参考的 HTML 页面链接如下:http ://www.foodfood.com/category/recipes/by-course/beverages/
上述链接部分Html如下:
<div class="recipeBox ">
<a href="http://www.foodfood.com/recipes/catholic-blended-margaritas/" rel="bookmark" title="Permanent Link to Catholic Blended Margaritas">
<div class="pic">
<img width="230" height="150" src="http://www.foodfood.com/wp-content/uploads/2012/07/230x150xCatholic-Blended-Margaritas-230x150.jpg.pagespeed.ic.p_7Vr37LwJ.jpg" class="post_img_thumb wp-post-image" alt="Catholic-Blended-Margaritas" title="Catholic-Blended-Margaritas"/> </div>
<div class="detailBox">
<h3>Catholic Blended Margaritas</h3>
<p><p>Blended Margaritas is a delicious drink which can be enjoyed on any festive</p>
</p>
<div class="timer">5 Mins</div>
<a href="http://www.foodfood.com/recipes/catholic-blended-margaritas/?comments=1#comments_det"><span class="comments">No Comments</span> </a>
</div>
</a>
</div>
请注意文本“Catholic Blended Margaritas”(我想要的)嵌套在两个<div>
标签内,这给我带来了问题。