1

I need to match two datasets on three variables. Two of the three variables do not present misspellings (by design). The fuzzy match is required only for the third variable.

The standard fuyyzmerge generate some issues by fuzzy-joining all three variables.

Is there a way to specify which of the three should be fuzzy matched and which exact-matched?

reproducible example:

dataset_1 <- setNames(data.frame(c(1995,1996,1995,1996),c("AA","AA","BB","BB"),c("AAAA","AAAA","BBBB","BBBB")), c("var_1", "var_2", "var_3"))
dataset_2 <- setNames(data.frame(c(1995,1996,1995,1996),c("AA","AA","BB","BB"),c("AAAA","AAAA","BBBB","BBBC"),c("A","B","C","D")), c("var_1", "var_2", "var_3","var_4"))


merged <- stringdist_join(dataset_1, dataset_2, 
                                  by=c("var_1","var_2","var_3"),  
                                  max_dist = 2, 
                                  method = c("soundex"),
                                  mode = "full", 
                                  ignore_case = FALSE)

Ideal result:

merged <- setNames(data.frame(rep(1995,4),c("AA","AA","BB","BB"),c("AAAA","AAAA","BBBB","BBBB"),c("A","B","C","D")), c("var_1", "var_2", "var_3","var_4"))
4

1 回答 1

0

stringdist_join is a wrapper around fuzzy_join, and fuzzy_join has a match_fun argument that can either be a unique function or a list of functions as long as your by argument, so we can use fuzzy_full_join (which is just fuzzy_join with mode= "full"):

library(fuzzyjoin)
res <- fuzzy_full_join(dataset_1, dataset_2, 
                by=c("var_1","var_2","var_3"),
                list(`==`, `==`, function(x,y) stringdist::stringdist(x,y, "soundex") <= 2))
res
#   var_1.x var_2.x var_3.x var_1.y var_2.y var_3.y var_4
# 1    1995      AA    AAAA    1995      AA    AAAA     A
# 2    1996      AA    AAAA    1996      AA    AAAA     B
# 3    1995      BB    BBBB    1995      BB    BBBB     C
# 4    1996      BB    BBBB    1996      BB    BBBC     D

Because of the nature of fuzzy matching, values are not generally the same on the lhs and rhs, so we end up with two sets of by columns, if you want to preserve only the lhs we can do :

library(dplyr)
res %>% 
  select(-ends_with(".y")) %>%
  rename_all(~sub("\\.x$","",.))

#   var_1 var_2 var_3 var_4
# 1  1995    AA  AAAA     A
# 2  1996    AA  AAAA     B
# 3  1995    BB  BBBB     C
# 4  1996    BB  BBBB     D
于 2019-03-20T13:47:05.827 回答