2

I'm combining similar names using stringdist(), and have it working using lapply, but it's taking 11 hours to run through 500k rows and I'd like to see if a data.table solution would work faster. Here's an example and my attempted solution so far built from readings here, here, here, here, and here, but I'm not quite pulling it off:

library(stringdist)
library(data.table)
data("mtcars")
mtcars$cartype <- rownames(mtcars)
mtcars$id <- seq_len(nrow(mtcars))

I'm currently using lapply() to cycle through the strings in the cartype column and bring together those rows whose cartype names are closer than a specified value (.08).

output <- lapply(1:length(mtcars$cartype), function(x) mtcars[which(stringdist(mtcars$cartype[x], mtcars$cartype, method ="jw", p=0.08)<.08), ])

> output[1:3]
[[1]]
              mpg cyl disp  hp drat    wt  qsec vs am gear carb       cartype id
Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4     Mazda RX4  1
Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4 Mazda RX4 Wag  2

[[2]]
              mpg cyl disp  hp drat    wt  qsec vs am gear carb       cartype id
Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4     Mazda RX4  1
Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4 Mazda RX4 Wag  2

[[3]]
            mpg cyl disp hp drat   wt  qsec vs am gear carb    cartype id
Datsun 710 22.8   4  108 93 3.85 2.32 18.61  1  1    4    1 Datsun 710  3

Data Table Attempt:

mtcarsdt <- as.data.table(mtcars)    
myfun <- function(x) mtcars[which(stringdist(mtcars$cartype[x], mtcars$cartype, method ="jw", p=0.08)<.08), ]

An intermediate step: This code pulls similar names based on the row's value that I manually plug into myfun(), but it repeats that value for all the rows.

res <- mtcarsdt[,.(vlist = list(myfun(1))),by=id]
res$vlist[[1]] #correctly combines the 2 mazda names
res$vlist[[6]] #but it's repeated down the line

I'm now trying to cycle through all the rows using set(). I'm close, but although the code appears to be correctly matching the text from the 12th column (cartype) it's returning the values from the first column, mpg:

for (i in 1:32) set(mtcarsdt,i ,12L, myfun(i))
> mtcarsdt
     mpg cyl  disp  hp drat    wt  qsec vs am gear carb                   cartype id
 1: 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4                 c(21, 21)  1
 2: 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4                 c(21, 21)  2
 3: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1                      22.8  3

Now, this is pretty hacky, but I found that if I create a copy of the cartype column and place it in the first column it pretty much works, but there must be a cleaner way to do this. Also, it would be nice to keep the output in a list form like the lapply() output above as I have other post-processing steps set up for that format.

mtcars$cartypeorig <- mtcars$cartype
mtcars <- mtcars[,c(14,1:13)]
mtcarsdt <- as.data.table(mtcars)
for (i in 1:32) set(mtcarsdt,i ,13L, myfun(i))

 > mtcarsdt[1:14,cartype]
 [1] "c(\"Mazda RX4\", \"Mazda RX4 Wag\")"                        
 [2] "c(\"Mazda RX4\", \"Mazda RX4 Wag\")"                        
 [3] "Datsun 710"                                                 
 [4] "Hornet 4 Drive"                                             
 [5] "Hornet Sportabout"                                          
 [6] "Valiant"                                                    
 [7] "Duster 360"                                                 
 [8] "c(\"Merc 240D\", \"Merc 230\", \"Merc 280\")"               
 [9] "c(\"Merc 240D\", \"Merc 230\", \"Merc 280\", \"Merc 280C\")"
[10] "c(\"Merc 240D\", \"Merc 230\", \"Merc 280\", \"Merc 280C\")"
[11] "c(\"Merc 230\", \"Merc 280\", \"Merc 280C\")"               
[12] "c(\"Merc 450SE\", \"Merc 450SL\", \"Merc 450SLC\")"         
[13] "c(\"Merc 450SE\", \"Merc 450SL\", \"Merc 450SLC\")"         
[14] "c(\"Merc 450SE\", \"Merc 450SL\", \"Merc 450SLC\")"         
4

1 回答 1

0

您是否尝试过使用 的矩阵版本stringdist

res = stringdistmatrix(mtcars$cartype, mtcars$cartype, method = 'jw', p = 0.08)

out = as.data.table(which(res < 0.08, arr.ind = T))[, .(list(mtcars[row,])), by = col]$V1

identical(out, output)
#[1] TRUE

现在,您可能不能只为 500k X 500k 矩阵运行上述代码,但您可以将其拆分为更小的部分(选择适合您的数据/内存大小的大小):

size = 4 # dividing into pieces of size 4x4
         # I picked a divisible number, a little more work will be needed
         # if you have a residue (nrow(mtcars) = 32)
setDT(mtcars)

grid = CJ(seq_len(nrow(mtcars)/4), seq_len(nrow(mtcars)/4))

indices = grid[, {
            res = stringdistmatrix(mtcars[seq((V1-1)*size+1, (V1-1)*size + size), cartype],
                                   mtcars[seq((V2-1)*size+1, (V2-1)*size + size), cartype],
                                   method = 'jw', p = 0.08)
            out = as.data.table(which(res < 0.08, arr.ind = T))
            if (nrow(out) > 0)
              out[, .(row = (V1-1)*size+row, col = (V2-1)*size +col)]
          }, by = .(V1, V2)]

identical(indices[, .(list(mtcars[row])), by = col]$V1, lapply(output, setDT))
#[1] TRUE
于 2016-05-02T17:01:09.950 回答