1

我想提取下面粘贴的 HTML 页面部分中存在的文本“Catholic Blended Margaritas”。

我同样使用了以下 xPath 表达式:

xPath = "//div[@class='recipeBox']/div[@class='detailBox']/h3/text()";

我将它传递给 HTMLCleaner,我将其部分代码粘贴在这里:

//use the cleaner to "clean" the HTML and return it as a TagNode object i.e. HTML page root node
    TagNode rootNode = htmlCleaner.clean(new   InputStreamReader(conn.getInputStream()));   

    // query XPath  
    Object[] nodes = rootNode.evaluateXPath(xpath);   

但是上面的表达式返回零节点。

我已经粘贴了 Html 的部分。事实上,我想要所有这些节点的文本,我只粘贴了 Html 的一部分。供您参考的 HTML 页面链接如下:http ://www.foodfood.com/category/recipes/by-course/beverages/

上述链接部分Html如下:

<div class="recipeBox ">
        <a href="http://www.foodfood.com/recipes/catholic-blended-margaritas/" rel="bookmark" title="Permanent Link to Catholic Blended Margaritas">
            <div class="pic">
                <img width="230" height="150" src="http://www.foodfood.com/wp-content/uploads/2012/07/230x150xCatholic-Blended-Margaritas-230x150.jpg.pagespeed.ic.p_7Vr37LwJ.jpg" class="post_img_thumb wp-post-image" alt="Catholic-Blended-Margaritas" title="Catholic-Blended-Margaritas"/>             </div>
            <div class="detailBox">
                <h3>Catholic Blended Margaritas</h3>
                <p><p>Blended Margaritas is a delicious drink which can be enjoyed on any festive</p>
</p>
                <div class="timer">5 Mins</div>
                <a href="http://www.foodfood.com/recipes/catholic-blended-margaritas/?comments=1#comments_det"><span class="comments">No Comments</span> </a>
            </div>
        </a>
    </div>

请注意文本“Catholic Blended Margaritas”(我想要的)嵌套在两个<div>标签内,这给我带来了问题。

4

1 回答 1

0

我看到//div[@class='recipeBox']//div[@class='detailBox']/h3/text()您的示例页面有 2 个问题:

  • “类”属性中的尾随空格<div class="recipeBox ">
  • 将目标元素嵌套在<a href="http://www.foodfood.com/recipes/catholic-blended-margaritas/" rel="bookmark" title="Permanent Link to Catholic Blended Margaritas">链接内

所以我建议你试试//div[normalize-space(@class)='recipeBox']//div[@class='detailBox']/h3/text()

于 2014-01-22T09:01:08.443 回答