xml - 使用 UIMA 从 XML 文件中提取文本

Question

我正在使用 UIMA 为 XML 构建文本提取器。由于我是 UIMA 框架的初学者，我想知道如何去做。

我知道 UIMA 可以对文件的特定部分进行注释，但是如何有效地提取信息？任何帮助表示赞赏。

谢谢，贾廷

score 3 · Accepted Answer

在UIMA Ruta开发人员的有限视角下，我将 UIMA Ruta 的HtmlAnnotator用于这些用例。这当然不是最有效的方法。分析引擎不会为元素使用单独的类型，因为它只知道最常见的 html 标记，但如果需要，我会在 UIMA Ruta 中执行到预定义类型系统的转换。在后端，应用了htmlparser。

score 1 · Accepted Answer

这是一个集合阅读器，可帮助您入门：

import static com.google.common.base.Preconditions.checkArgument;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.List;

import org.apache.uima.UimaContext;
import org.apache.uima.collection.CollectionException;
import org.apache.uima.fit.descriptor.TypeCapability;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceInitializationException;
import org.jdom2.Document;
import org.jdom2.Element;
import org.jdom2.JDOMException;
import org.jdom2.Text;
import org.jdom2.input.SAXBuilder;
import org.jdom2.output.Format;
import org.jdom2.output.XMLOutputter;
import org.jdom2.xpath.XPathExpression;
import org.jdom2.xpath.XPathFactory;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;



@TypeCapability(outputs = "xxx")
public class XmlCollectionReader extends JCasCollectionReader_ImplBase {
    private static Logger LOG = LoggerFactory.getLogger(XmlCollectionReader.class);

    private SAXBuilder builder;
    private XMLOutputter xo;
    private XPathExpression<Object> sentenceXPath;

    @Override
    public void initialize(UimaContext context) throws ResourceInitializationException {
        super.initialize(context);
        try {
            File corpusDir = new File(inputDir);
            checkArgument(corpusDir.exists());
            fileIterator = DirectoryIterator.get(directoryIterator, corpusDir, "xml", false);
            builder = new SAXBuilder();
            xo = new XMLOutputter();
            xo.setFormat(Format.getRawFormat());
            sentenceXPath = XPathFactory.instance().compile("//S");
        } catch (Exception e) {
            throw new ResourceInitializationException(
                    ResourceInitializationException.NO_RESOURCE_FOR_PARAMETERS,
                    new Object[] { inputDir });
        }
    }

    public void getNext(JCas jcas) throws IOException, CollectionException {

        File file = fileIterator.next();
        try {
            LOG.debug("reading {}", file.getName());
            Document doc = builder.build(new FileInputStream(file));
            Element rootNode = doc.getRootElement();

            String title = xo.outputString(rootNode.getChild("Title").getContent());

            for (Object sentence : sentenceXPath.evaluate(rootNode)) {
                Element sentenceE = (Element) sentence;
                    ...
                }
            }

            jcas.setDocumentText(...);

        } catch (JDOMException e) {
            throw new CollectionException(e);
        }
    }
}

xml - 使用 UIMA 从 XML 文件中提取文本

2 回答 2

Related

Reference