r - 在 data.table 或快速子集中快速搜索

Question

我有一个 800k+ 行重复（随机）值的 DF。对于每一行，我需要取一个值并找到具有相同值的新行的索引。例如“asd”——我还能在哪里看到它？不需要当前行的索引。

我当前的解决方案：子集一个 DF 并通过删除当前行创建一个临时框架/表。问题 - 每 1000 次迭代需要一分钟。所以 800+k 行需要我 13 个小时才能运行。有任何想法吗？谢谢！

在原始 DF（未子集）上运行 < 1 秒，但您可以想象它为我提供了当前行的索引。

编辑：我现实生活中的 DF 超过 1 列。下面的例子被简化了。我需要获取其他值为的V1[1]行号，然后对每一行重复 for等V1V1[1]V1[2]

library(fastmatch)
library(stringi)
set.seed(12345)
V1 = stringi::stri_rand_strings(800000, 3)
df0 = as.data.table(V1)
mapped = matrix("",nrow=800000)

print(Sys.time())
for (i in 1:1000) {
  tmp_df = df0[-i,] #This takes very long time!!!
  mapped[i] = fmatch(df0$V1[i],tmp_df$V1)
}
print(Sys.time())

View(mapped)

score 2 · Accepted Answer

数据：

library("data.table")
set.seed(12345)
V1 = stringi::stri_rand_strings(80, 3)
df0 <- data.table( sample(V1, 100, replace = TRUE ))

代码：

df0[, id := list(list(.I)), by = V1]  # integer id

输出：

head(df0, 10)
#     V1          id
# 1: iuR      1,2,21
# 2: iuR      1,2,21
# 3: KXc           3
# 4: LwA           4
# 5: pYn           5
# 6: qoN        6,66
# 7: 5Xt           7
# 8: wBH        8,77
# 9: V9r     9,39,54
# 10: 9ks 10,28,42,48

编辑- 删除当前索引：

df0[, id2 := 1:.N ]
df0[, id := list(list(unlist(id)[ unlist(id) != .I  ] )), by = id2 ]
df0[, id2 := NULL ]
df0[ lengths(id) > 0, ]
head( df0, 10 )
#     V1       id
# 1: iuR     2,21
# 2: iuR     1,21
# 3: KXc         
# 4: LwA         
# 5: pYn         
# 6: qoN       66
# 7: 5Xt         
# 8: wBH       77
# 9: V9r    39,54
# 10: 9ks 28,42,48

r - 在 data.table 或快速子集中快速搜索

1 回答 1

Related

Reference