I need to match two datasets on three variables. Two of the three variables do not present misspellings (by design). The fuzzy match is required only for the third variable.
The standard fuyyzmerge generate some issues by fuzzy-joining all three variables.
Is there a way to specify which of the three should be fuzzy matched and which exact-matched?
reproducible example:
dataset_1 <- setNames(data.frame(c(1995,1996,1995,1996),c("AA","AA","BB","BB"),c("AAAA","AAAA","BBBB","BBBB")), c("var_1", "var_2", "var_3"))
dataset_2 <- setNames(data.frame(c(1995,1996,1995,1996),c("AA","AA","BB","BB"),c("AAAA","AAAA","BBBB","BBBC"),c("A","B","C","D")), c("var_1", "var_2", "var_3","var_4"))
merged <- stringdist_join(dataset_1, dataset_2,
by=c("var_1","var_2","var_3"),
max_dist = 2,
method = c("soundex"),
mode = "full",
ignore_case = FALSE)
Ideal result:
merged <- setNames(data.frame(rep(1995,4),c("AA","AA","BB","BB"),c("AAAA","AAAA","BBBB","BBBB"),c("A","B","C","D")), c("var_1", "var_2", "var_3","var_4"))