我正在尝试使用 Java 学习 Xpath 表达式的用法。我正在使用 Jtidy 将 HTML 页面转换为 XHTML,以便我可以使用 XPath 表达式轻松解析它。我有以下代码:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = ConvertXHTML("https://twitter.com/?lang=fr");
//Create XPath
XPathFactory xpathfactory = XPathFactory.newInstance();
XPath Inst= xpathfactory.newXPath();
NodeList nodes = (NodeList)Inst.evaluate("//p/@align",doc,XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); ++i)
{
Element e = (Element) nodes.item(i);
System.out.println(e);
}
public Document ConvertXHTML(String link){
try{
URL u = new URL(link);
BufferedInputStream instream=new BufferedInputStream(u.openStream());
FileOutputStream outstream=new FileOutputStream("out.xhtml");
Tidy c=new Tidy();
c.setShowWarnings(false);
c.setInputEncoding("UTF-8");
c.setOutputEncoding("UTF-8");
c.setXHTML(true);
return c.parseDOM(instream,outstream);
}
它适用于大多数 URL,但这个:
因为它,我得到了这个例外:
javax.xml.transform.TransformerException:索引 -1 越界.....
下面是我得到的堆栈跟踪的一部分:
javax.xml.transform.TransformerException: Index -1 out of bounds for length 128
at java.xml/com.sun.org.apache.xpath.internal.XPath.execute(XPath.java:366)
at java.xml/com.sun.org.apache.xpath.internal.XPath.execute(XPath.java:303)
at java.xml/com.sun.org.apache.xpath.internal.jaxp.XPathImplUtil.eval(XPathImplUtil.java:101)
at java.xml/com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.eval(XPathExpressionImpl.java:80)
at java.xml/com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.evaluate(XPathExpressionImpl.java:89)
at files.ExampleCode.GetThoselinks(ExampleCode.java:50)
at files.ExampleCode.DoSomething(ExampleCode.java:113)
at files.ExampleCode.GetThoselinks(ExampleCode.java:81)
at files.ExampleCode.DoSomething(ExampleCode.java:113)
我不确定问题是出在网站转换后的 xhtml 还是其他问题上。谁能说出代码中有什么问题?任何编辑都会有所帮助。