1

我正在尝试以某种方式将两个表与一些代码连接起来,其中在一个列中,键可能是原始键的子集。

Event
id  date  ProductId  quantity
a   xyz   1234567    30
a   abc   5826811    20
b   def   3619100    10
b   ghi   9268420    50

ProductDimension
code     name  type
234-567  p1    c1
826-81   p2    c2
61-9100  p3    c3  


Result should be:
eventAU
id date ProductId quantity name  type
a   xyz   1234567    30    p1     c1
a   abc   5826811    20    p2     c2
b   def   3619100    10    p3     c3

从这个问题中得到提示,我正在尝试使用以下方法进行模糊连接:

ProductDimension$regex <- gsub("-", "", ProductDimension$code)

eventTbl <- tbl_df(Events)
prodcutTbl <- tbl_df(ProductDimension)

eventsAU <- regex_left_join(eventTbl , prodcutTbl , by = c(ProductId = "regex"))

但我得到以下异常:

Error: All columns in a tibble must be 1d or 2d objects: * Column `col` is NULL
4

1 回答 1

1

一个dplyr选项fuzzyjoin可能是:

stringdist_inner_join(df1, 
                      df2 %>%
                       mutate(code = sub("-", "", code)),
                      method = "lv",
                      by = c("ProductId" = "code"))

  id    date  ProductId quantity code   name  type 
  <chr> <chr>     <int>    <int> <chr>  <chr> <chr>
1 a     xyz     1234567       30 234567 p1    c1   
2 a     abc     5826811       20 82681  p2    c2   
3 b     def     3619100       10 619100 p3    c3   

或者,如果您指定最大距离,您可以跳过该sub()部分并使用dplyr:

stringdist_inner_join(df1, 
                      df2,
                      method = "lv",
                      max_dist = 3,
                      by = c("ProductId" = "code"))
于 2020-07-31T08:44:55.803 回答