r - R：通过距离测量校正字符串（stringdistmatrix）

Question

我正在处理需要计算字符串中人名的唯一性的问题，但考虑到可能存在轻微的拼写错误。我的想法是将字符串设置为低于某个阈值（例如 levenshtein 距离低于 2）是相等的。现在我设法计算字符串距离，但没有对我的输入字符串进行任何更改，以使我获得正确数量的唯一名称。

library(stringdist);library(stringr)
names<-"Michael, Liz, Miichael, Maria"
names_split<-strsplit(names, ", ")[[1]]
stringdistmatrix(names_split,names_split)
     [,1] [,2] [,3] [,4]
[1,]    0    6    1    5
[2,]    6    0    7    4
[3,]    1    7    0    6
[4,]    5    4    6    0
(number_of_people<-str_count(names, ",")+1)
[1] 4

number_of_people 的正确值当然应该是 3。

因为我只对唯一名称的数量感兴趣，所以我不关心“Michael”是否被“Miichael”取代或反之。

score 0 · Accepted Answer

一种选择是尝试根据距离矩阵对名称进行聚类：

library(stringdist)
# create a 'dist' object (=lower triangular part of distance matrix)
d <- stringdistmatrix(names_split,method="osa")
# use hierarchical clustering to group nearest neighbors
hc <- hclust(d)
# visual inspection: y-axis labels the distance value
plot(hc)
# decide what distance value you find acceptable for grouping.
cutree(hc, h=3)

根据您的实际数据，您将需要尝试使用距离类型（qgrams/cosine 可能有用，或者在名称的情况下使用 jaro-winkler 距离）。

r - R：通过距离测量校正字符串（stringdistmatrix）

1 回答 1

Related

Reference