r - 多列模糊连接（多个）数据帧的最佳方法是什么？

Question

我需要加入多个数据框，但考虑到实验是在线进行的，并且参与者在输入他们的 ID 时经常马虎，我添加了冗余。他们还必须添加父母姓名和邮政编码的字母。我手动检查（有点）并且有很多错误。现在我需要按多列合并，而不仅仅是参与者 ID。

我认为使用fuzzyjoin 包最有意义，但我不确定如何将多个数据帧按多列合并？我应该一次只模糊_正确_加入一个数据帧吗？我一共有6个。

非常感谢！

这里摘录一些给你一个想法：

structure(list(participant = c("107", "110", "111", "116", "140", 
"141"), Vorname_Mutter_2_Buchstaben = c("th", "ro", "mo", "es", 
"br", "gl"), Vorname_Vater_2_Buchstaben = c("al", "ha", "wa", 
"th", "he", "re"), PLZ_letzte_2_Ziffern = c(28L, 4L, 23L, 10L, 
15L, 90L), date = structure(c(1587307867.619, 1586435099.121, 
1586424077.282, 1587733915.271, 1586794445.732, 1586896454.853
), tzone = "UTC", class = c("POSIXct", "POSIXt")), mean_RT = c(0.658042654028436, 
0.612637426900585, 0.721700276752767, 0.532778303249097, 0.448516151241535, 
0.59286090389016)), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

structure(list(participant = c("001", "240", "242", "243", "244", 
"245"), Vorname_Mutter_2_Buchstaben = c("ma", "el", "ur", "ka", 
"ja", "la"), Vorname_Vater_2_Buchstaben = c("he", "ma", "re", 
"jo", "fe", "ab"), PLZ_letzte_2_Ziffern = c(27L, 3L, 3L, 0L, 
47L, 66L), date = structure(c(1588072799.367, 1586624239.667, 
1586260007.882, 1586712365.514, 1586275669.545, 1586696526.84
), tzone = "UTC", class = c("POSIXct", "POSIXt")), RT_moving_variance = c(6258.46945397108, 
5172.19983111429, 5032.90280000055, 5906.46678346693, 18694.9916770777, 
7065.17254133398)), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

structure(list(participant = c("1", "105", "107", "110", "111", 
"116"), Vorname_Mutter_2_Buchstaben = c("ma", "an", "th", "ro", 
"mo", "es"), Vorname_Vater_2_Buchstaben = c("he", "ce", "al", 
"ha", "wa", "th"), PLZ_letzte_2_Ziffern = c("27", "0", "28", 
"4", "23", "10"), date = structure(c(1588071580.734, 1587402995.471, 
1587306792.774, 1586434189.309, 1586422686.217, 1587732745.487
), tzone = "UTC", class = c("POSIXct", "POSIXt")), on_task_mean = c(1, 
1, 2, 2, 1, 1)), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

structure(list(participant = c("270", "494", "261", "171", "177", 
"323"), Vorname_Mutter_2_Buchstaben = c("se", "br", "ma", "do", 
"ir", "li"), Vorname_Vater_2_Buchstaben = c("na", "th", "ar", 
"sv", "re", "ur"), PLZ_letzte_2_Ziffern = c("02", "38", "67", 
"03", "10", "07"), date = structure(c(1586187946.415, 1586212359.648, 
1586251863.165, 1586255167.624, 1586255616.763, 1586258326.743
), tzone = "UTC", class = c("POSIXct", "POSIXt")), Alter = c(26, 
27, 21, 28, 25, 22)), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

score 0 · Accepted Answer

假设数据是针对单个主题的：该列PLZ_letzte_2_Ziffern是integer前两个，character第三个和第四个，因此您需要将后两个（或全部）转换为数字。

然后你可以做一个迭代的完全连接，或者只是使用bind_rows. 这里有两种可能的tidyverse解决方案。我已经s1通过s4.

library(tidyverse)

list_of_subjects <- list(s1, s2, s3, s4) %>% 
  map(~{.x %>% 
      mutate(PLZ_letzte_2_Ziffern = as.numeric(PLZ_letzte_2_Ziffern))})

df <- list_of_subjects %>% reduce(full_join)

或者，您可以使用map_dfr，用于row_bind输出数据框。

df <- map_dfr(list(s1, s2, s3, s4), ~{
  .x %>% 
    mutate(PLZ_letzte_2_Ziffern = as.numeric(PLZ_letzte_2_Ziffern))})

在1.0.0 中使用新的colwise 功能dplyr：

df <- map_dfr(list(s1, s2, s3, s4), ~ {.x %>% 
    mutate(across(starts_with("PLZ"), as.numeric))})

该列是硬编码的并不是很好PLZ_letzte_2_Ziffern，但它适用于这个特定的示例。

r - 多列模糊连接（多个）数据帧的最佳方法是什么？

1 回答 1

Related

Reference