所以我的问题实际上与Lyngbakr 的问题相同,我有两个非常大的数据集,需要通过某些列中的完全匹配和其他列中的模糊匹配来连接它们。我希望匹配在出生日期列DOB
和性别列中准确,gender
但希望它们在names
列中“相似”。
通过“相似”,我希望能够使用一组特定的标准,例如:
- OSA 距离 <= 2 & JW 距离 <= 0.2 & ...
但是,如果这不可能,仅要求 OSA 距离 <= 2 将是朝着正确方向迈出的一大步。
当我尝试根据我自己的数据运行Lyngbakr's I 的答案时,我得到了错误:
Error in bmerge(i, x, leftcols, rightcols, roll, rollends, nomatch, mult, :
roll='nearest' can't be applied to a character column, yet.
以下是我尝试实施Lyngbakr答案的方法:
# copy left data
df <- base
# rename columns
names(df)[c(1, 3)] <- c("ID", "loc")
# copy right data
df_alt <- name_unique
# rename columns
names(df_alt)[c(1, 3)] <- c("ID", "loc")
# implement Lyngbakr's answer with stringdist() instead of abs()
df_alt[df
, on = .(ID, loc)
, roll = "nearest"
, .(ID, loc.x = i.loc, loc.y = x.loc, value, delta = stringdist(i.loc, x.loc))]
因此,在这里我只是尝试使用精确匹配DOB
和模糊匹配进行左连接names
,我已分别将其重命名为ID
和loc
在两个数据集上。
数据
这是我的数据的一个小例子:
library(data.table)
library(tidyverse)
base <- data.table(DOB = c("1956-01-01", "1994-05-13", "2001-07-03",
"1998-04-02", "1991-05-28", "2001-09-15",
"1999-04-05", "2001-04-10", "1996-01-14",
"2000-01-19") %>% as.Date,
gender = c("F", "F", "M", "F", "M", "F", "M", "F",
"F", "F"),
names = c("Regina_Douglas", "Tamar_Hurley", "John_Moreno",
"Josephine_Bone_O' Brian", "Borys_Holland",
"Tonisha_Moran", "Jarrad_Kaur", "Abbi_Kane",
"Leslie_Davis", "Blossom_Povey"),
row = 1:10)
name_unique <-
data.table(s_DOB = c("1941-01-09", "1976-09-22", "1996-08-07",
"1993-09-24", "1991-05-28", "2001-09-15",
"1969-03-21", "1939-06-25", "1996-01-14",
"1978-07-27") %>% as.Date,
s_gen = c("M", "M", "F", "M", "M", "F", "M", "F", "F",
"F"),
s_name = c("Brandon_Hampton", "John_Moreno", "Sally_Kemper",
"Nickolas_Bolden", "Boris_Holland", "Tonisha_Morann",
"Bryant_Lopez", "Kathryn_Krebs", "Lesli_David",
"Kelley__Owens"),
s_identif = c(178, 184, 136, 188, 198, 133, 197,
143, 200, 132))
所需的输出如下:
DOB gender names row s_identif
1956-01-01 F Regina_Douglas 1 NA
1994-05-13 F Tamar_Hurley 2 NA
2001-07-03 M John_Moreno 3 NA
1998-04-02 F Josephine_Bone_O' Brian 4 NA
1991-05-28 M Borys_Holland 5 198
2001-09-15 F Tonisha_Moran 6 133
1999-04-05 M Jarrad_Kaur 7 NA
2001-04-10 F Abbi_Kane 8 NA
1996-01-14 F Leslie_Davis 9 200
2000-01-19 F Blossom_Povey 10 NA
我也尝试过使用chameau13 函数,但无法正确实现它,并且由于该函数没有文档,我不知道如何使用它。正如他在帖子中提到的那样,fuzzy_join()
andfuzzy_left_join()
函数效率不高,需要超过 100 TB 的 RAM 才能在完整的数据集上运行。因此需要另一种解决方案。