我正在开展一个将 NLP 应用于临床数据的辅助项目,并且我正在使用 Java 的 BreakIterator 将文本分成句子以进行进一步分析。使用 BreakIterator 时,我遇到了 BreakIterator 无法识别以数值开头的句子的问题。
例子:
String text = "1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level. This is another sentence."
预期输出:
1) No acute osseous abnormality.
2) Mild to moderate disc space narrowing at the L4-5 level.
This is another sentence.
实际输出:
1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level.
This is another sentence.
代码:
import java.text.BreakIterator;
import java.util.*;
public class Test {
public static void main(String[] args) {
String text = "1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level. This is another sentence";
Locale locale = Locale.US;
BreakIterator splitIntoSentences = BreakIterator.getSentenceInstance(locale);
splitIntoSentences.setText(text);
int index = 0;
while (splitIntoSentences.next() != BreakIterator.DONE) {
String sentence = text.substring(index, splitIntoSentences.current());
System.out.println(sentence);
index = splitIntoSentences.current();
}
}
}
任何帮助,将不胜感激。我试图在网上找到答案,但无济于事。