这是来自两个需要左连接在一起的表中的人员全名的示例数据,其中df1
和 左侧表,以及df2
右侧:
df1 <- data.frame(fullName = 'Michael Gadson', age = 53) %>%
rbind(data.frame(fullName = 'Mike Gardnero', age = 43)) %>%
rbind(data.frame(fullName = 'Nicholas Richards', age = 13)) %>%
rbind(data.frame(fullName = 'Mikey Richards', age = 53)) %>%
rbind(data.frame(fullName = 'DeAndre Jamison', age = 28)) %>%
rbind(data.frame(fullName = 'Anthony Allison', age = 21)) %>%
rbind(data.frame(fullName = 'Ricky Smith', age = 82)) %>%
rbind(data.frame(fullName = 'Smith Rickie', age = 60)) %>%
rbind(data.frame(fullName = 'Johnny Williams', age = 60))
df2 <- data.frame(playerName = 'Mike Gadson', color = 'red') %>%
rbind(data.frame(playerName = 'Anthony Allison', color = 'green')) %>%
rbind(data.frame(playerName = 'Mike Gardnero', color = 'purple')) %>%
rbind(data.frame(playerName = "De Andre' Jamison", color = 'orange')) %>%
rbind(data.frame(playerName = 'Nicholas Richards III', color = 'yellow')) %>%
rbind(data.frame(playerName = 'John Kind', color = 'grey')) %>%
rbind(data.frame(playerName = 'Mike Richards', color = 'white')) %>%
rbind(data.frame(playerName = 'Rick Smith', color = 'blue')) %>%
rbind(data.frame(playerName = 'Smith Rickie', color = 'black')) %>%
rbind(data.frame(playerName = 'Anthony Albados', color = 'violet'))
output_df <- data.frame(fullName = 'Michael Gadson', age = 53, playerName = 'Mike Gadson', color = 'red') %>%
rbind(data.frame(fullName = 'Mike Gardnero', age = 43, playerName = 'Mike Gardnero', color = 'purple')) %>%
rbind(data.frame(fullName = 'Nicholas Richards', age = 13, playerName = 'Nicholas Richards III', color = 'yellow')) %>%
rbind(data.frame(fullName = 'Mikey Richards', age = 53, playerName = 'Mike Richards', color = 'white')) %>%
rbind(data.frame(fullName = 'DeAndre Jamison', age = 28, playerName = "De Andre' Jamison", color = 'orange')) %>%
rbind(data.frame(fullName = 'Anthony Allison', age = 21, playerName = 'Anthony Allison', color = 'green')) %>%
rbind(data.frame(fullName = 'Ricky Smith', age = 82, playerName = 'Rick Smith', color = 'blue')) %>%
rbind(data.frame(fullName = 'Smith Rickie', age = 60, playerName = 'Smith Rickie', color = 'black')) %>%
rbind(data.frame(fullName = 'Johnny Williams', age = 60, playerName = NA, color = NA))
> output_df
fullName age playerName color
1 Michael Gadson 53 Mike Gadson red
2 Mike Gardnero 43 Mike Gardnero purple
3 Nicholas Richards 13 Nicholas Richards III yellow
4 Mikey Richards 53 Mike Richards white
5 DeAndre Jamison 28 De Andre' Jamison orange
6 Anthony Allison 21 Anthony Allison green
7 Ricky Smith 82 Rick Smith blue
8 Smith Rickie 60 Smith Rickie black
9 Johnny Williams 60 <NA> <NA>
关于这里的棘手情况/边缘情况的一些评论:
- 这是一个左连接,因此
output_df
应该具有与左侧数据框相同的行数df1
。 - 左连接不应该被相似的名字混淆。
Michael Gadson
-->Mike Gadson
,而不是其他迈克的名字之一。 - 左连接不应该被颠倒的名字混淆。(
Ricky Smith
-->Rick Smith
,不是Smith Rickie
) - 左连接不应该被
III
名称的后缀或额外的空格或符号(De Andre'
vsDeAndre
)混淆
编辑:我尝试了以下输出:
zed <- fuzzyjoin::stringdist_left_join(x=df1, y=df2, max_dist = 0.3, by=c('fullName'='playerName'), method = 'jaccard')
> zed
fullName age playerName color
1 Michael Gadson 53 Mike Gadson red
2 Mike Gardnero 43 Mike Gadson red
3 Mike Gardnero 43 Mike Gardnero purple
4 Nicholas Richards 13 Nicholas Richards III yellow
5 Mikey Richards 53 Mike Richards white
6 DeAndre Jamison 28 De Andre' Jamison orange
7 Anthony Allison 21 Anthony Allison green
8 Richard Smith 82 Rich Smith blue
9 Smith Rickie 60 Rich Smith blue
10 Smith Rickie 60 Smith Rickie black
11 Johnny Williams 60 <NA> <NA>
它做得很好,但仍然不完美。最值得注意的是,Mike Gardnero
并且Smith Rickie
在使用jaccard
with max_dist
of 0.3 时会重复,因为右侧有多行满足相似性标准......但是,我们的输出不应创建这些重复项(也许将右侧的值保留为相似度最高)。