java - BreakIterator 无法正确处理中文文本

Question

我使用 BreakIterator.getWordInstance 将中文文本拆分为单词。这是我的例子

import java.text.BreakIterator;
import java.util.Locale;

public class Sample {
    public static void main(String[] args) {
        String stringToExamine = "I like to eat apples. 我喜欢吃苹果。";

        //print each word in order
        BreakIterator boundary = BreakIterator.getWordInstance(new Locale("zh", "CN"));
        boundary.setText(stringToExamine);

        printEachForward(boundary, stringToExamine);
    }

    public static void printEachForward(BreakIterator boundary, String source) {
        int start = boundary.first();
        for (int end = boundary.next(); end != BreakIterator.DONE; start = end, end = boundary.next()) {
            System.out.println(start + ": " + source.substring(start, end));
        }
    }
}

我的示例文本来自https://stackoverflow.com/a/42219474/954439

我得到的输出是

0: I
1:  
2: like
6:  
7: to
9:  
10: eat
13:  
14: apples
20: .
21:  
22: 我喜欢吃苹果
28: 。

而预期的输出是

0 I
1  
2 like
6  
7 to
9  
10 eat
13  
14 apples
20 .
21  
22 我
23 喜欢
25 吃
26 苹果
28 。

我什至尝试过纯中文文本，但是在空格和标点字符上单词被破坏了。

我正在为服务器编程，所以 jar 文件的大小不是一个大问题。我试图找出给定内容与使用最少公共子序列（但在单词上）的示例内容相比不同的单词数。

我究竟做错了什么？

score 6 · Accepted Answer

该标准BreakIterator不支持在 CJK 表意文字的完整字符串中检测“单词”边界。有一个关于这个主题的错误报告，但它在 2006 年被关闭为“不会修复”。

相反，您需要使用ICU implementation。如果您在 Android 上进行开发，那么您已经将其作为android.icu.text.BreakIterator. 否则，您需要从http://site.icu-project.org/download下载 ICU4J 库，它的格式为com.ibm.icu.text.BreakIterator.

java - BreakIterator 无法正确处理中文文本

1 回答 1

Related

Reference