1

这是来自两个需要左连接在一起的表中的人员全名的示例数据,其中df1和 左侧表,以及df2右侧:

df1 <- data.frame(fullName = 'Michael Gadson', age = 53) %>%
  rbind(data.frame(fullName = 'Mike Gardnero', age = 43)) %>%
  rbind(data.frame(fullName = 'Nicholas Richards', age = 13)) %>%
  rbind(data.frame(fullName = 'Mikey Richards', age = 53)) %>%
  rbind(data.frame(fullName = 'DeAndre Jamison', age = 28)) %>%
  rbind(data.frame(fullName = 'Anthony Allison', age = 21)) %>%
  rbind(data.frame(fullName = 'Ricky Smith', age = 82)) %>%
  rbind(data.frame(fullName = 'Smith Rickie', age = 60)) %>%
  rbind(data.frame(fullName = 'Johnny Williams', age = 60))

df2 <- data.frame(playerName = 'Mike Gadson', color = 'red') %>%
  rbind(data.frame(playerName = 'Anthony Allison', color = 'green')) %>%
  rbind(data.frame(playerName = 'Mike Gardnero', color = 'purple')) %>%
  rbind(data.frame(playerName = "De Andre' Jamison", color = 'orange')) %>%
  rbind(data.frame(playerName = 'Nicholas Richards III', color = 'yellow')) %>%
  rbind(data.frame(playerName = 'John Kind', color = 'grey')) %>%
  rbind(data.frame(playerName = 'Mike Richards', color = 'white')) %>%
  rbind(data.frame(playerName = 'Rick Smith', color = 'blue')) %>%
  rbind(data.frame(playerName = 'Smith Rickie', color = 'black')) %>%
  rbind(data.frame(playerName = 'Anthony Albados', color = 'violet'))

output_df <- data.frame(fullName = 'Michael Gadson', age = 53, playerName = 'Mike Gadson', color = 'red') %>%
  rbind(data.frame(fullName = 'Mike Gardnero', age = 43, playerName = 'Mike Gardnero', color = 'purple')) %>%
  rbind(data.frame(fullName = 'Nicholas Richards', age = 13, playerName = 'Nicholas Richards III', color = 'yellow')) %>%
  rbind(data.frame(fullName = 'Mikey Richards', age = 53, playerName = 'Mike Richards', color = 'white')) %>%
  rbind(data.frame(fullName = 'DeAndre Jamison', age = 28, playerName = "De Andre' Jamison", color = 'orange')) %>%
  rbind(data.frame(fullName = 'Anthony Allison', age = 21, playerName = 'Anthony Allison', color = 'green')) %>%
  rbind(data.frame(fullName = 'Ricky Smith', age = 82, playerName = 'Rick Smith', color = 'blue')) %>%
  rbind(data.frame(fullName = 'Smith Rickie', age = 60, playerName = 'Smith Rickie', color = 'black')) %>%
  rbind(data.frame(fullName = 'Johnny Williams', age = 60, playerName = NA, color = NA))

> output_df
           fullName age            playerName  color
1    Michael Gadson  53           Mike Gadson    red
2     Mike Gardnero  43         Mike Gardnero purple
3 Nicholas Richards  13 Nicholas Richards III yellow
4    Mikey Richards  53         Mike Richards  white
5   DeAndre Jamison  28     De Andre' Jamison orange
6   Anthony Allison  21       Anthony Allison  green
7       Ricky Smith  82            Rick Smith   blue
8      Smith Rickie  60          Smith Rickie  black
9   Johnny Williams  60                  <NA>   <NA>

关于这里的棘手情况/边缘情况的一些评论:

  • 这是一个左连接,因此output_df应该具有与左侧数据框相同的行数df1
  • 左连接不应该被相似的名字混淆。Michael Gadson--> Mike Gadson,而不是其他迈克的名字之一。
  • 左连接不应该被颠倒的名字混淆。(Ricky Smith--> Rick Smith,不是Smith Rickie
  • 左连接不应该被III名称的后缀或额外的空格或符号(De Andre'vs DeAndre)混淆

编辑:我尝试了以下输出:

zed <- fuzzyjoin::stringdist_left_join(x=df1, y=df2, max_dist = 0.3, by=c('fullName'='playerName'), method = 'jaccard')

> zed
            fullName age            playerName  color
1     Michael Gadson  53           Mike Gadson    red
2      Mike Gardnero  43           Mike Gadson    red
3      Mike Gardnero  43         Mike Gardnero purple
4  Nicholas Richards  13 Nicholas Richards III yellow
5     Mikey Richards  53         Mike Richards  white
6    DeAndre Jamison  28     De Andre' Jamison orange
7    Anthony Allison  21       Anthony Allison  green
8      Richard Smith  82            Rich Smith   blue
9       Smith Rickie  60            Rich Smith   blue
10      Smith Rickie  60          Smith Rickie  black
11   Johnny Williams  60                  <NA>   <NA>

它做得很好,但仍然不完美。最值得注意的是,Mike Gardnero并且Smith Rickie在使用jaccardwith max_distof 0.3 时会重复,因为右侧有多行满足相似性标准......但是,我们的输出不应创建这些重复项(也许将右侧的值保留为相似度最高)。

4

0 回答 0