给定的 R 脚本计算两个名称之间的相似度百分比,如图所示。这里我们有两列“names1”和“names2”,它们各自的 id 在 id1 和 id2 中。我的要求是,当我们执行脚本时,“names1”中的每个名称都与“names2”列中的每个名称进行比较,我不希望将相同的条目,即 (id1,names1) 列与 ( id2,names2) 列。对于插图,第一个 (id1,names1) 条目 (1,Prabhudev Ramanujam) 应该与所有 (id2,names2) 进行比较,而不是与第一个 (id2,names2) 条目进行比较。同样适用于所有对。另外,如果公式
percent(sapply(names1, function(i)RecordLinkage::levenshteinSim(i,names2)))
可以调整以在此处产生类似且更快的结果,因为它会减慢大数据的速度,请附加快照,请帮助。
library(stringdist)
library(RecordLinkage)
library(dplyr)
library(scales)
id1 <- 1:8
names1 <- c("Prabhudev Ramanujam","Deepak Subramaniam","Sangamer
Mahapatra","SriramKishore Sharma",
"Deepak Subramaniam","SriramKishore Sharma","Deepak
Subramaniam","Sangamer Mahapatra")
id2 <- c(1,2,3,4,11,13,9,10)
names2 <- c("Prabhudev Ramanujam","Deepak Subramaniam","Sangamer
Mahapatra","SriramKishore Sharma",
"Deepak Subramaniam","Sangamer Mahapatra","SriramKishore
Sharma","Deepak Subramaniam")
Name_Data <- data.frame(id1,names1,id2,names2)
Percent<- percent(sapply(names1, function(i)
RecordLinkage::levenshteinSim(i,names2)))
Total_Value <- data.frame(id2,names2,Percent)