0

我正在开展一个将 NLP 应用于临床数据的辅助项目,并且我正在使用 Java 的 BreakIterator 将文本分成句子以进行进一步分析。使用 BreakIterator 时,我遇到了 BreakIterator 无法识别以数值开头的句子的问题。

例子:

String text = "1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level. This is another sentence."

预期输出:

1) No acute osseous abnormality.
2) Mild to moderate disc space narrowing at the L4-5 level.
This is another sentence.

实际输出:

1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level.
This is another sentence.

代码:

import java.text.BreakIterator;
import java.util.*;

public class Test {
   public static void main(String[] args) {
      String text = "1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level. This is another sentence";
      Locale locale = Locale.US;
      BreakIterator splitIntoSentences = BreakIterator.getSentenceInstance(locale);
      splitIntoSentences.setText(text);
      int index = 0;
      while (splitIntoSentences.next() != BreakIterator.DONE) {
        String sentence = text.substring(index, splitIntoSentences.current());
         System.out.println(sentence);
         index = splitIntoSentences.current();
      }
   }
}

任何帮助,将不胜感激。我试图在网上找到答案,但无济于事。

4

1 回答 1

0

我现在使用的是 Apache OpenNLP,而不是使用 BreakIterator,它工作得很好!

于 2020-11-24T04:20:09.023 回答