r - R 中的 as.h2o() 将文件上传到 h2o 环境需要很长时间

Question

我正在使用 h2o 进行一些建模，并调整了模型，我现在希望它用于执行大量预测，大约 60 亿个预测/行，每个预测行需要 80 列数据

我已经将输入数据集分解为大约 500 x 1200 万行块，每个块包含相关的 80 列数据。

但是，将data.table1200 万乘 80 列的数据上传到 h2o 需要相当长的时间，而对我来说，执行 500 次需要的时间非常长……我认为这是因为它在上传之前先解析对象。

相比之下，预测部分相对较快......

有什么建议可以加快这部分的速度吗？改变核心数量会有帮助吗？

以下是问题的可重现示例...

  # Load libraries
  library(h2o)
  library(data.table)

  # start up h2o using all cores...
  localH2O = h2o.init(nthreads=-1,max_mem_size="16g")

  # create a test input dataset
  temp <- CJ(v1=seq(20),
             v2=seq(7),
             v3=seq(24),
             v4=seq(60),
             v5=seq(60))
  temp <- do.call(cbind,lapply(seq(16),function(y){temp}))
  colnames(temp) <- paste0('v',seq(80))

  # this is the part that takes a long time!!
  system.time(tmp.obj <- as.h2o(localH2O,temp,key='test_input'))

  #|======================================================================| 100%
  #   user  system elapsed 
  #357.355   6.751 391.048

score 13 · Accepted Answer

由于您在本地运行 H2O，因此您希望将该数据保存为文件，然后使用：

h2o.importFile(localH2O, file_path, key='test_intput')

这将使每个线程并行读取文件的各个部分。如果您在单独的服务器上运行 H2O，则需要将数据复制到服务器可以读取的位置（大多数人不会将服务器设置为从笔记本电脑上的文件系统读取）。

as.h2o()将文件串行上传到 H2O。使用h2o.importFile()，H2O 服务器会找到该文件并并行读取它。

看起来您使用的是 H2O 版本 2。相同的命令将在 H2Ov3 中运行，但一些参数名称略有变化。新的参数名称在这里：http ://cran.r-project.org/web/packages/h2o/h2o.pdf

score 5 · Accepted Answer

在为这个问题而苦苦挣扎之后，我做了一些测试，发现对于 R 内存中的对象（即，您没有奢侈地以 .csv 或 .txt 形式提供它们），是迄今为止加载它们的最快方法(~21 x) 是使用 data.table 中的fwrite 函数将 csv 写入磁盘并使用 h2o.importFile 读取它。

我尝试的四种方法：

直接使用 as.h2o()
使用 write.csv() 写入磁盘，然后使用 h2o.importFile() 加载
将数据分成两半，在每一半上运行 as.h2o()，然后使用 h2o.rbind() 组合
使用 fwrite() 从 data.table 写入磁盘，然后使用 h2o.importFile() 加载

我对不同大小的 data.frame 进行了测试，结果看起来很清楚。

如果有人有兴趣复制，代码如下。

library(h2o)
library(data.table)
h2o.init()

testdf <-as.data.frame(matrix(nrow=4000000,ncol=100))
testdf[1:1000000,] <-1000       # R won't let me assign the whole thing at once
testdf[1000001:2000000,] <-1000
testdf[2000001:3000000,] <-1000
testdf[3000001:4000000,] <-1000

resultsdf <-as.data.frame(matrix(nrow=20,ncol=5))
names(resultsdf) <-c("subset","method 1 time","method 2 time","method 3 time","method 4 time")
for(i in 1:20){
    subdf <- testdf[1:(200000*i),]
    resultsdf[i,1] <-100000*i
    
    # 1: use as.h2o()
    
    start <-Sys.time()
    as.h2o(subdf)
    stop <-Sys.time()
    resultsdf[i,2] <-as.numeric(stop)-as.numeric(start)
    
    # 2: use write.csv then h2o.importFile() 

    start <-Sys.time()
    write.csv(subdf,"hundredsandthousands.csv",row.names=FALSE)
    h2o.importFile("hundredsandthousands.csv")
    stop <-Sys.time()
    resultsdf[i,3] <-as.numeric(stop)-as.numeric(start)

    # 3: Split dataset in half, load both halves, then merge

    start <-Sys.time()
    length_subdf <-dim(subdf)[1]
    h2o1 <-as.h2o(subdf[1:(length_subdf/2),])
    h2o2 <-as.h2o(subdf[(1+length_subdf/2):length_subdf,])
    h2o.rbind(h2o1,h2o2)
    stop <-Sys.time()
    resultsdf[i,4] <- as.numeric(stop)-as.numeric(start)
    
    # 4: use fwrite then h2o.importfile()

    start <-Sys.time()
    fwrite(subdf,file="hundredsandthousands.csv",row.names=FALSE)
    h2o.importFile("hundredsandthousands.csv")
    stop <-Sys.time()
    resultsdf[i,5] <-as.numeric(stop)-as.numeric(start)

    plot(resultsdf[,1],resultsdf[,2],xlim=c(0,4000000),ylim=c(0,900),xlab="rows",ylab="time/s",main="Scaling of different methods of h2o frame loading")
    for (i in 1:3){
        points(resultsdf[,1],resultsdf[,(i+2)],col=i+1)
        }
    legendtext <-c("as.h2o","write.csv then h2o.importFile","Split in half, as.h2o and rbind","fwrite then h2o.importFile")
    legend("topleft",legend=legendtext,col=c(1,2,3,4),pch=1)

    print(resultsdf)
    flush.console()
    }

r - R 中的 as.h2o() 将文件上传到 h2o 环境需要很长时间

2 回答 2

Related

Reference