0

给定的 R 脚本计算两个名称之间的相似度百分比,如图所示。这里我们有两列“names1”和“names2”,它们各自的 id 在 id1 和 id2 中。我的要求是,当我们执行脚本时,“names1”中的每个名称都与“names2”列中的每个名称进行比较,我不希望将相同的条目,即 (id1,names1) 列与 ( id2,names2) 列。对于插图,第一个 (id1,names1) 条目 (1,Prabhudev Ramanujam) 应该与所有 (id2,names2) 进行比较,而不是与第一个 (id2,names2) 条目进行比较。同样适用于所有对。另外,如果公式

percent(sapply(names1, function(i)RecordLinkage::levenshteinSim(i,names2))) 

可以调整以在此处产生类似且更快的结果,因为它会减慢大数据的速度,请附加快照,请帮助。

library(stringdist)
library(RecordLinkage)
library(dplyr)
library(scales)
id1    <- 1:8 
names1 <- c("Prabhudev Ramanujam","Deepak Subramaniam","Sangamer 
Mahapatra","SriramKishore Sharma",
        "Deepak Subramaniam","SriramKishore Sharma","Deepak 
Subramaniam","Sangamer Mahapatra")
id2    <- c(1,2,3,4,11,13,9,10)
names2 <- c("Prabhudev Ramanujam","Deepak Subramaniam","Sangamer 
Mahapatra","SriramKishore Sharma",
        "Deepak Subramaniam","Sangamer Mahapatra","SriramKishore 
Sharma","Deepak Subramaniam")
Name_Data <- data.frame(id1,names1,id2,names2)
Percent<- percent(sapply(names1, function(i) 
RecordLinkage::levenshteinSim(i,names2)))
Total_Value <- data.frame(id2,names2,Percent)

快照视觉

4

1 回答 1

1

不会快多少,但我的建议是:

percent(unlist(lapply(1:length(names1), function(x) {
  levenshteinSim(names1[x], names2[!(names2==names1[x] & id2==id1[x])])})))

编辑:

或者,这可能会更快 - 我想它会有所不同:

as.vector(t(1 - (stringdistmatrix(names1, names2, method = "lv") / 
         outer(nchar(names1), nchar(names2), pmax))))[unlist(lapply(1:length(names1), function(x) !(names2==names1[x] & id2==id1[x])))]
于 2018-01-06T13:35:09.000 回答