最终,我认为您寻求的方法是允许max_dist
成为距离向量,您可能会在哪里做stringdist_inner_join(..., max_dist=c(0,2))
。不幸的是,虽然已要求(在 2017 年:https ://github.com/dgrtwo/fuzzyjoin/issues/36和https://github.com/dgrtwo/fuzzyjoin/issues/21),但似乎没有尚未实施。
如果您能负担得起更大的中间连接产品,一种解决方法是允许它,然后过滤掉decade
不精确连接的位置。
缺少数据,我将演示使用ggplot2::diamonds
. 在这里,我需要正常stringdist
的功能cut
和完全匹配的clarity
.
d <- data.frame(cut = c("Idea", "Premiums", "Premioom", "VeryGood", "VeryGood", "Faiir"),
clarity = rep(c("SI1", "SI2"),3),
type = 1:6)
data("diamonds", package = "ggplot2")
diamonds <- diamonds[1:10,]
joined <- stringdist_inner_join(diamonds, d, by = c("cut", "clarity"))
joined
# # A tibble: 8 x 13
# carat cut.x color clarity.x depth table price x y z cut.y clarity.y type
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <chr> <int>
# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 Idea SI1 1
# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 Premiums SI2 2
# 3 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 Premioom SI1 3
# 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 Premiums SI2 2
# 5 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 VeryGood SI2 4
# 6 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 VeryGood SI1 5
# 7 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 Faiir SI2 6
# 8 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 VeryGood SI1 5
subset(joined, clarity.x == clarity.y)
# # A tibble: 2 x 13
# carat cut.x color clarity.x depth table price x y z cut.y clarity.y type
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <chr> <int>
# 1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 Premioom SI1 3
# 2 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 VeryGood SI1 5