0

我有一个包含拼写错误和不一致的大学名称列表。我需要将它们与大学名称的官方列表进行匹配,以将我的数据链接在一起。

我知道模糊匹配/加入是我要走的路,但我对正确的方法有点迷失。任何帮助将不胜感激。

d<-data.frame(name=c("University of New Yorkk", "The University of South
 Carolina", "Syracuuse University", "University of South Texas", 
"The University of No Carolina"), score = c(1,3,6,10,4))

y<-data.frame(name=c("University of South Texas",  "The University of North
 Carolina", "University of South Carolina", "Syracuse
 University","University of New York"), distance = c(100, 400, 200, 20, 70))

我想要一个让它们尽可能紧密地融合在一起的输出

matched<-data.frame(name=c("University of New Yorkk", "The University of South Carolina", 
"Syracuuse University","University of South Texas","The University of No Carolina"), 
correctmatch = c("University of New York", "University of South Carolina", 
"Syracuse University","University of South Texas", "The University of North Carolina"))
4

1 回答 1

1

adist()用于这样的事情,并且调用了很少的包装函数closest_match()来帮助将一个值与一组“好/允许”值进行比较。

library(magrittr) # for the %>%

closest_match <- function(bad_value, good_values) {
  distances <- adist(bad_value, good_values, ignore.case = TRUE) %>%
    as.numeric() %>%
    setNames(good_values)

  distances[distances == min(distances)] %>%
    names()
}

sapply(d$name, function(x) closest_match(x, y$name)) %>%
  setNames(d$name)

University of New Yorkk The University of South\n Carolina               Syracuuse University 
"University of New York"     "University of South Carolina"           "University of New York" 
University of South Texas      The University of No Carolina 
"University of South Texas"     "University of South Carolina" 

adist()利用Levenshtein 距离来比较两个字符串之间的相似性。

于 2018-10-30T20:09:07.380 回答