2

我有两个要合并的数据库。从这个链接:做一个“模糊”和非模糊,多对一合并 data.table。我知道我可以合并这些 data.tables,当没有直接匹配时,最近可用的年份如下:

  library(data.table)
  dfA <- fread("
  A   B   C   D   E   F   G   Z   iso   year   matchcode
  1   0   1   1   1   0   1   0   NLD   2010   NLD2010
  2   1   0   0   0   1   0   1   NLD   2014   NLD2014
  3   0   0   0   1   1   0   0   AUS   2010   AUS2010
  4   1   0   1   0   0   1   0   AUS   2006   AUS2006
  5   0   1   0   1   0   1   1   USA   2008   USA2008
  6   0   0   1   0   0   0   1   USA   2010   USA2010
  7   0   1   0   1   0   0   0   USA   2012   USA2012
  8   1   0   1   0   0   1   0   BLG   2008   BLG2008
  9   0   1   0   1   1   0   1   BEL   2008   BEL2008
  10  1   0   1   0   0   1   0   BEL   2010   BEL2010
  11  0   1   1   1   0   1   0   NLD   2010   NLD2010
  12  1   0   0   0   1   0   1   NLD   2014   NLD2014
  13  0   0   0   1   1   0   0   AUS   2010   AUS2010
  14  1   0   1   0   0   1   0   AUS   2006   AUS2006
  15  0   1   0   1   0   1   1   USA   2008   USA2008
  16  0   0   1   0   0   0   1   USA   2010   USA2010
  17  0   1   0   1   0   0   0   USA   2012   USA2012
  18  1   0   1   0   0   1   0   BLG   2008   BLG2008
  19  0   1   0   1   1   0   1   BEL   2008   BEL2008
  20  1   0   1   0   0   1   0   BEL   2010   BEL2010",
  header = TRUE)

  dfB <- fread("
  A   B   C   D   H   I   J   K   iso   year   matchcode
  1   0   1   1   1   0   1   0   NLD   2009   NLD2009
  2   1   0   0   0   1   0   1   NLD   2014   NLD2018
  3   0   0   0   1   1   0   0   AUS   2011   AUS2011
  4   1   0   1   0   0   1   0   AUS   2007   AUS2007
  5   0   1   0   1   0   1   1   USA   2007   USA2007
  6   0   0   1   0   0   0   1   USA   2010   USA2010
  7   0   1   0   1   0   0   0   USA   2013   USA2013
  8   1   0   1   0   0   1   0   BLG   2007   BLG2007
  9   0   1   0   1   1   0   1   BEL   2009   BEL2009
  10   1   0   1   0   0   1   0  BEL   2012   BEL2012",
  header = TRUE)

#change the name of the matchcode-column
setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))
setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))

#store column-order for in the end
namesA <- as.character( names( dfA ) )
namesB <- as.character( setdiff( names(dfB), names(dfA) ) )
colorder <- c(namesA, namesB)

#create columns to join on
dfA[, `:=`(iso.join = isoA, year.join = yearA)]
dfB[, `:=`(iso.join = isoB, year.join = yearB)]

#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]

#drop columns that are not needed
result[, grep("^i\\.", names(result)) := NULL ]
result[, grep("join$", names(result)) := NULL ]

#set column order
setcolorder(result, colorder)

我对此有两个问题。

1)编辑:这个问题是错字的结果

2) NLD 2014indfANLD 2018in匹配dfB。如果我认为 4 年的差异太大,我想将其限制为两年,我该怎么办?

当我想限制 和 之间允许的年数时该怎么dfAdfB

4

1 回答 1

3

你有两个选择:

  1. 使用roll = 2roll = -2将要求最近在 2 年内一个方向。
  2. 再添加两列以dfA使其成为显式非等连接。
#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"), roll = 2 ] 

# or
result <- dfB[dfA, on = c("iso.join", "year.join"), roll = -2 ] 

非 equi 连接需要您进行额外的工作,因为它不带roll = 'nearest'参数,因此您需要mult = 'first'在后续操作中使用或执行过滤器。

dfA[, `:=`(min_year.join = yearA - 2,
           max_year.join = yearA + 2)]

result <- dfB[dfA,
              on = .(iso.join,
                          year.join <= max_year.join,
                          year.join >= min_year.join)
              #, mult = 'first'
              ]

#drop columns that are not needed
result[, grep("^i\\.", names(result)) := NULL ]
result[, grep("join", names(result)) := NULL ] #removed $

#set column order
setcolorder(result, colorder)
result
于 2019-11-24T13:02:32.393 回答