我有两个示例数据框,df1
如下df2
所示。
df1
具有选定的网球比赛装置列表,其中包含球员姓名 ( player1_name
, player_name2
) 和比赛日期。此处为玩家使用全名。
df2
具有每个日期的所有网球比赛结果 ( winner
, ) 的列表。loser
在这里,使用名字的第一个字母和完整的姓氏。固定装置和结果的球员姓名是从不同的网站上抓取的。因此,在某些情况下,姓氏可能不完全匹配。考虑到这一点,我想添加一个新列df1
,说明 player1 或 player2 是否赢了。基本上,我想通过给定相同日期的某些部分匹配方式从 df2映射player1_name
和player2_name
从到赢家和输家。df1
dput(df1)
structure(list(date = structure(c(18534, 18534, 18534, 18534,
18534, 18534, 18534), class = "Date"), player1_name = c("Laslo Djere",
"Hugo Dellien", "Quentin Halys", "Steve Johnson", "Henri Laaksonen",
"Thiago Monteiro", "Andrej Martin"), player2_name = c("Kevin Anderson",
"Ricardas Berankis", "Marcos Giron", "Roberto Carballes", "Pablo Cuevas",
"Nikoloz Basilashvili", "Joao Sousa")), row.names = c(NA, -7L
), class = "data.frame")
dput(df2)
structure(list(date = structure(c(18534, 18534, 18534, 18534,
18534, 18534, 18534, 18534, 18534, 18534, 18534, 18534, 18534,
18534, 18534, 18534, 18534, 18534, 18534, 18534), class = "Date"),
winner = c("L Harris", "M Berrettini", "M Polmans", "C Garin",
"A Davidovich Fokina", "D Lajovic", "K Anderson", "R Berankis",
"M Giron", "A Rublev", "N Djokovic", "R Carballes Baena",
"A Balazs", "P Cuevas", "T Monteiro", "S Tsitsipas", "D Shapovalov",
"G Dimitrov", "R Bautista Agut", "A Martin"), loser = c("A Popyrin",
"V Pospisil", "U Humbert", "P Kohlschreiber", "H Mayot",
"G Mager", "L Djere", "H Dellien", "Q Halys", "S Querrey",
"M Ymer", "S Johnson", "Y Uchiyama", "H Laaksonen", "N Basilashvili",
"J Munar", "G Simon", "G Barrere", "R Gasquet", "J Sousa"
)), row.names = c(NA, -20L), class = "data.frame")
我创建了一个自定义函数,该函数可以使用 RecordLinkage 包将字符串与字符串向量中最接近的匹配项进行匹配。我可以使用这个函数编写一个超级低效的代码,但在去那里之前,我想看看我是否能以更有效的方式来做。
ClosestMatch <- function(string, stringVector,max_threshold=0.5) {
df<- character()
for (i in 1:length(string)) {
distance <- levenshteinSim(string[i], stringVector)
if (max(distance)>=max_threshold) {
df[i]<- stringVector[which.max(distance)]
}
else {
df[i]= NA
}
}
return(df)
}