r - R - 带加权词的字符串距离

Question

有没有办法使用stringdist包或其他字符串距离包来加权特定单词？

通常我的字符串共享一个常见的单词，例如“city”或“university”，结果得到相对接近的字符串距离匹配，但非常不同（即：“犹他大学”和“俄亥俄大学”，或“ XYZ 市”和“ABC 市”）。

我知道操作（删除、插入、替换）的权重可能因算法而异，但我还没有看到一种方法来包含与权重配对的单词列表。有什么想法吗？

当然，一种选择是str_remove在匹配之前使用那些常用词，但这有一个问题，即“XYZ 县”和“XYZ 市”看起来相同。

例子：

“犹他大学”和“俄亥俄大学”

stringdist("University of Utah", "University of Ohio") / max(nchar("University of Utah"), nchar("University of Ohio"))

标准化字符串距离为 0.22222。这是相对较低的。但实际上，“Utah”和“Ohio”之间的标准化 OSA 字符串距离为 1：

4 / 18 = 0.222222

但是，事先删除“University of”和“State”等其他常见字符串会导致“University of Ohio”和“Ohio State”之间匹配。

对像“University of”这样的字符串进行加权计算，例如规范化分母中使用的实际字符数的 0.25 将减少这些常见子字符串的影响，即：

4 / (18 * 0.25) = 0.888888。

当我们考虑对州与大学的例子做同样的事情时，这里就变得模糊了：

stringdist("University of Ohio", "Ohio State")

产生 16。但取分母的 0.25：

16 / (18 * .25) = 3.55555。

也许更好的选择是使用 LCS，但降低匹配常见字符串列表的子字符串。因此，即使“University of Utah”和“University of Ohio”有一个 14 个字符的公共子字符串，如果“University of”出现在此列表中，它的 LCS 值也会降低。

编辑：另一个想法

我有另一个想法 - 使用tidytextpackage and unnest_tokens，可以生成所有匹配字符串中最常见单词的列表。考虑相对于它们在数据集中的共性来降低这些词的权重可能会很有趣，因为它们越常见，它们的区分能力就越小……

score 2 · Accepted Answer

也许一个想法可能是在计算字符串距离之前重新组合相似的术语，以避免完全比较“俄亥俄州立大学”和“俄亥俄大学”。

# Strings
v1 <- c("University of Ohio", "University of Utah", "Ohio State", "Utah State",
        "University Of North Alabama", "University of South Alabama", "Alabama State",
        "Arizona State University Polytechnic", "Arizona State University Tempe", 
        "Arizona State", "Metropolitan State University of Denver", 
        "Metropolitan University Of The State Of Denver", "University Of Colorado", 
        "Western State Colorado University", "The Dalton College", "The Colorado State", 
        "The Dalton State College", "Columbus State University", "Dalton College")

# Remove stop words
v2 <- strsplit(v1, " ") %>% 
  map_chr(~ paste(.x[!tolower(.x) %in% tm::stopwords()], collapse = " "))

# Define groups
groups <- c(Group1 = "state", 
            Group2 = "university", 
            Group3 = "college",
            # Groups 4-5 must contain BOTH terms
            Group4 = ".*(state.*university|university.*state).*", 
            Group5 = ".*(state.*college|college.*state).*")

# Iterate over the list and assign groups
dat <- list(words = v2, pattern = groups)
lst <- dat$pattern %>% map(~ grepl(.x, dat$words, ignore.case = TRUE))

lst %>%
  # Make sure groups 1 to 3 and 4-5 are mutually exclusive
  # i.e: if a string contains "state" AND "university" (Group4), it must not be in Group1
  modify_at(c("Group1", "Group2", "Group3"), 
            ~ ifelse(lst$Group4 & .x | lst$Group5 & .x, !.x, .x)) %>%
  # Return matches from strings 
  map(~ v2[.x]) %>%
  # Compute the stringdistance for each group
  map(~ stringdistmatrix(.x, .x)) ## Maybe using method = "jw" ?

r - R - 带加权词的字符串距离

1 回答 1

Related

Reference