有时,您获得的数据并不干净,并且使用的单词、拼写错误或被操纵的单词会有所不同。我们能找到与句子中的单词最相似的例子吗?
例如,如果我正在寻找单词“Awesome”,它已被用作句子中的变体,例如
"We had an awwweesssommmeeee dinner at sea resort"
"We had an awesomeeee dinner at sea resort"
"We had an awwesooomee dinner at sea resort"
etc..
有时,您获得的数据并不干净,并且使用的单词、拼写错误或被操纵的单词会有所不同。我们能找到与句子中的单词最相似的例子吗?
例如,如果我正在寻找单词“Awesome”,它已被用作句子中的变体,例如
"We had an awwweesssommmeeee dinner at sea resort"
"We had an awesomeeee dinner at sea resort"
"We had an awwesooomee dinner at sea resort"
etc..
您想纯粹在 SQL 中执行此操作吗?
否则,您将需要一些模糊匹配的字符串比较函数来调用 SQL。该函数将使用一些算法组合,例如 Jaro-Winkler、levenshtein、ngrams 等。或拼音匹配变音位双变音位、变音位 3、soundex
根据您使用的 sql-server,您可以安装和使用数据质量组件,该组件具有其中一些算法的自定义 CLR 实现。或者 SSIS 模糊匹配组件。或者.....
我个人已经编写了 c# .net clr 函数来为我做这件事,但我只处理名称,句子变得更加复杂,你可能想要拆分为单词/标记以便作为部分进行比较,然后作为整体进行比较......
作为一种快速的解决方案,您可以将文档小写,在空格上标记它们,并折叠每个术语的连续字符:
import java.util.Map;
import java.util.Scanner;
import java.util.Set;
import java.util.TreeMap;
import java.util.TreeSet;
import java.util.stream.Collectors;
public class CollapseConsecutiveCharsDemo {
public static String collapse(final String term) {
final StringBuilder buffer = new StringBuilder();
if (!term.isEmpty()) {
char prev = term.charAt(0);
buffer.append(prev);
for (int i = 1; i < term.length(); i += 1) {
final char curr = term.charAt(i);
if (curr != prev) {
buffer.append(curr);
prev = curr;
}
}
}
return buffer.toString();
}
public static void main(final String... documents) {
final Map<String, Set<String>> termVariations = new TreeMap<>();
for (final String document : documents) {
final Scanner scanner = new Scanner(document.toLowerCase());
while (scanner.hasNext()) {
final String expandedTerm = scanner.next();
final String collapsedTerm = collapse(expandedTerm);
Set<String> variations = termVariations.get(collapsedTerm);
if (null == variations) {
variations = new TreeSet<String>();
termVariations.put(collapsedTerm, variations);
}
variations.add(expandedTerm);
}
}
for (final Map.Entry<String, Set<String>> entry : termVariations.entrySet()) {
final String term = entry.getKey();
final Set<String> variations = entry.getValue();
System.out.printf("variations(\"%s\") = {%s}%n",
term,
variations.stream()
.map((variation) -> String.format("\"%s\"", variation))
.collect(Collectors.joining(", ")));
}
}
}
示例运行:
% java CollapseConsecutiveCharsDemo "We had an awwweesssommmeeee dinner at sea resort" "We had an awesomeeee dinner at sea resort" "We had an awwesooomee dinner at sea resort"
variations("an") = {"an"}
variations("at") = {"at"}
variations("awesome") = {"awesomeeee", "awwesooomee", "awwweesssommmeeee"}
variations("diner") = {"dinner"}
variations("had") = {"had"}
variations("resort") = {"resort"}
variations("sea") = {"sea"}
variations("we") = {"we"}
要获得更详细的解决方案,您可以使用Stanford CoreNLP 标记器对文档进行标记,它可以正确处理标点符号,并将其与拼写更正(例如libevenshtein )结合使用。