r - In R - fastest way pairwise comparing character strings on similarity

Question

I'm looking for a way to speed up the following approach. Any pointers are very welcome. Where are the bottlenecks?

Say I have the following data.frame:

df <- data.frame(names=c("A ADAM", "S BEAN", "A APPLE", "J BOND", "J BOND"), 
                      v1=c("Test_a", "Test_b", "Test_a", "Test_b", "Test_b"), 
                      v2=c("Test_c", "Test_c", "Test_d", "Test_d", "Test_d"))

I want to compare each pair of rows in df on their JaroWinkler similarity.

With some help of others (see this post), I've been able to construct this code:

#columns to compare 
testCols <- c("names", "v1", "v2")

#compare pairs
RowCompare= function(x){
 comp <- NULL
 pairs <- t(combn(nrow(x),2))
 for(i in 1:nrow(pairs)){
   row_a <- pairs[i,1]
   row_b <- pairs[i,2]
   a_tests <- x[row_a,testCols]
   b_tests <- x[row_b,testCols]
 comp <- rbind(comp, c(row_a, row_b, TestsCompare(a_tests, b_tests)))
 }

colnames(comp) <- c("row_a","row_b","names_j","v1_j","v2_j")
return(comp)
}

#define TestsCompare
TestsCompare=function(x,y){
names_j <- stringdist(x$names, y$names, method = "jw")
v1_j <-stringdist(x$v1, y$v1, method = "jw")
v2_j <-stringdist(x$v2, y$v2, method = "jw")
c(names_j,v1_j, v2_j)
}

This generates the correct output:

output = as.data.frame(RowCompare(df))

> output
   row_a row_b   names_j      v1_j      v2_j
1      1     2 0.4444444 0.1111111 0.0000000
2      1     3 0.3571429 0.0000000 0.1111111
3      1     4 0.4444444 0.1111111 0.1111111
4      1     5 0.4444444 0.1111111 0.1111111  
5      2     3 0.4603175 0.1111111 0.1111111
6      2     4 0.3333333 0.0000000 0.1111111
7      2     5 0.3333333 0.0000000 0.1111111
8      3     4 0.5634921 0.1111111 0.0000000
9      3     5 0.5634921 0.1111111 0.0000000
10     4     5 0.0000000 0.0000000 0.0000000

However, my real data.frame has 8 million observations and I make 17 comparisons. To run this code takes days...

I am looking for ways to speed up this process:

Should I use matrices instead of data.frames?
How to parallelize this process?
Vectorize?

score 2 · Accepted Answer

如果您遍历要检查的变量，则可以为每个变量创建一个距离矩阵stringdist::stringdistmatrix。使用lapplyor的形式purrr::map将返回距离矩阵列表（每列一个），您可以依次迭代到 cal broom::tidy，这会将它们转换为格式良好的 data.frames。如果你使用purrr::map_df和使用它的.id参数，结果将被强制到一个大的data.frame中，并且每个列表元素的名称将被添加为一个新的列，这样你就可以保持它们的正确性。生成的 data.frame 将是长格式，因此如果您希望它与上面的结果匹配，请使用tidyr::spread.

如果，正如您在评论中提到的，您想对不同的变量使用不同的方法，请与map2or并行迭代Map。

共，

library(tidyverse)

map2(df, c('soundex', 'jw', 'jw'), ~stringdist::stringdistmatrix(.x, method = .y)) %>% 
    map_df(broom::tidy, .id = 'var') %>% 
    spread(var, distance)

##    item1 item2 names        v1        v2
## 1      2     1     1 0.1111111 0.0000000
## 2      3     1     1 0.0000000 0.1111111
## 3      3     2     1 0.1111111 0.1111111
## 4      4     1     1 0.1111111 0.1111111
## 5      4     2     1 0.0000000 0.1111111
## 6      4     3     1 0.1111111 0.0000000
## 7      5     1     1 0.1111111 0.1111111
## 8      5     2     1 0.0000000 0.1111111
## 9      5     3     1 0.1111111 0.0000000
## 10     5     4     0 0.0000000 0.0000000

请注意，虽然choose(5, 2)返回 10 个观测值，但choose(8000000, 2)返回 3.2e+13（32万亿）个观测值，因此出于实际目的，即使这将比您现有的代码工作得更快（并且stringdistmatrix在可能的情况下进行一些并行化），数据也会变得非常大除非您只处理子集。

r - In R - fastest way pairwise comparing character strings on similarity

1 回答 1

Related

Reference