r - 加快插值练习

Question

我正在对大约 120 万个观测值运行大约 45,000 个局部线性回归（基本上），所以我很感激一些帮助试图加快速度，因为我不耐烦。

我基本上是在尝试为一堆公司构建逐年工资合同——功能工资（给定公司、年份、职位的经验）。

这是我正在使用的数据（基本结构）集：

> wages
         firm year position exp salary
      1: 0007 1996        4   1  20029
      2: 0007 1996        4   1  23502
      3: 0007 1996        4   1  22105
      4: 0007 1996        4   2  23124
      5: 0007 1996        4   2  22700
     ---                              
1175141:  994 2012        5   2  47098
1175142:  994 2012        5   2  45488
1175143:  994 2012        5   2  47098
1175144:  994 2012        5   3  45488
1175145:  994 2012        5   3  47098

我想为所有公司构建经验水平 0 到 40 的工资函数，例如：

> salary_scales
        firm year position exp   salary
     1: 0007 1996        4   0       NA
     2: 0007 1996        4   1 21878.67
     3: 0007 1996        4   2 23401.33
     4: 0007 1996        4   3 23705.00
     5: 0007 1996        4   4 24260.00
    ---                                
611019: 9911 2015        4  36       NA
611020: 9911 2015        4  37       NA
611021: 9911 2015        4  38       NA
611022: 9911 2015        4  39       NA
611023: 9911 2015        4  40       NA

为此，我一直在使用 COBS（约束 B 样条曲线）包（在 @BondedDust 的建议下），它允许我建立工资合同的单调性。

一些问题仍然存在；特别是，当我需要推断时（只要给定的公司没有任何非常年轻或非常老的员工），拟合就有失去单调性或降至 0 以下的趋势。

为了解决这个问题，我一直在数据边界之外使用简单的线性外推——将拟合曲线延伸到外面min_exp，max_exp使其通过两个最低（或最高）拟合点——并不完美，但它似乎正在做很不错。

考虑到这一点，这就是我到目前为止的做法（请记住，我是个data.table狂热分子）：

#get the range of experience for each firm
wages[,min_exp:=min(exp),by=.(year,firm,position)]
wages[,max_exp:=max(exp),by=.(year,firm,position)]
#Can't interpolate if there are only 2 or 3 unique experience cells represented
wages[,node_count:=length(unique(exp)),by=.(year,firm,position)]
#Nor if there are too few teachers
wages[,ind_count:=.N,by=.(year,firm,position)]
#Also troublesome when there is little variation in salaries like so:
wages[,sal_scale_flag:=mean(abs(salary-mean(salary)))<50,by=.(year,firm,position)]
wages[,sal_count_flag:=length(unique(salary))<5,by=.(year,firm,position)]

cobs_extrap<-function(exp,salary,min_exp,max_exp,
                      constraint="increase",print.mesg=F,nknots=8,
                      keep.data=F,maxiter=150){
  #these are passed as vectors
  min_exp<-min_exp[1]
  max_exp<-min(max_exp[1],40)
  #get in-sample fit
  in_sample<-predict(cobs(x=exp,y=salary,
                          constraint=constraint,
                          print.mesg=print.mesg,nknots=nknots,
                          keep.data=keep.data,maxiter=maxiter),
                     z=min_exp:max_exp)[,"fit"]

  #append by linear extension below min_exp
  c(if (min_exp==1) NULL else in_sample[1]-
      (min_exp:1)*(in_sample[2]-in_sample[1]),in_sample,
    #append by linear extension above max_exp
    if (max_exp==40) NULL else in_sample[length(in_sample)]+(1:(40-max_exp))*
      (in_sample[length(in_sample)]-in_sample[length(in_sample)-1]))
}

salary_scales<-
  wages[node_count>=7&ind_count>=10
               &sal_scale_flag==0&sal_count_flag==0,
               .(exp=0:40,
                 salary=cobs_extrap(exp,salary,min_exp,max_exp)),
               by=.(year,firm,position)]

请注意可能会减慢我的代码速度的特别之处？还是我被迫忍耐？

在这里玩的是一些较小的坚定头寸组合：

    firm year position exp salary count
 1: 0063 2010        5   2  37433    10
 2: 0063 2010        5   2  38749    10
 3: 0063 2010        5   4  38749    10
 4: 0063 2010        5   8  42700    10
 5: 0063 2010        5  11  47967    10
 6: 0063 2010        5  15  50637    10
 7: 0063 2010        5  19  51529    10
 8: 0063 2010        5  23  50637    10
 9: 0063 2010        5  33  52426    10
10: 0063 2010        5  37  52426    10
11: 9908 2006        4   1  26750    10
12: 9908 2006        4   6  36043    10
13: 9908 2006        4   7  20513    10
14: 9908 2006        4   8  45023    10
15: 9908 2006        4  13  33588    10
16: 9908 2006        4  15  46011    10
17: 9908 2006        4  15  37179    10
18: 9908 2006        4  22  43704    10
19: 9908 2006        4  28  56078    10
20: 9908 2006        4  29  44866    10

score 5 · Accepted Answer

您的代码中有很多可以改进的地方，但让我们关注这里的主要瓶颈。手头的问题可以被认为是一个令人尴尬的并行问题。这意味着您的数据可以分成多个较小的部分，每个部分都可以在单独的线程上单独计算，而无需任何额外开销。

要查看当前问题的并行化可能性，您应该首先注意您正在为每个单独的公司和/或分别执行完全相同的计算。例如，您可以将计算拆分为每年的较小子任务，然后将这些子任务分配给不同的 CPU/GPU 内核。以这种方式可以获得显着的性能增益。最后，当子任务的处理完成后，您唯一需要做的就是合并结果。

但是，R 及其所有内部库作为单线程运行。您必须明确拆分数据，然后将子任务分配给不同的核心。为了实现这一点，存在许多支持多线程的 R 包。我们将doparallel在此处的示例中使用该包。

您没有提供足够大的显式数据集来有效测试性能，因此我们将首先创建一些随机数据：

set.seed(42)
wages<-data.table(firm=substr(10001:10010,2,5)[sample(10,size=1e6,replace=T)],
                  year=round(unif(1e6,1996,2015)),
                  position=round(runif(1e6,4,5)),
                  exp=round(runif(1e6,1,40)),
                  salary=round(exp(rnorm(1e6,mean=10.682,sd=.286))))
> wages
         firm year position exp salary
      1: 0001 1996        4  14  66136
      2: 0001 1996        4   3  42123
      3: 0001 1996        4   9  46528
      4: 0001 1996        4  11  35195
      5: 0001 1996        4   2  43926
     ---                              
 999996: 0010 2015        5  11  43140
 999997: 0010 2015        5  23  64025
 999998: 0010 2015        5  31  35266
 999999: 0010 2015        5  11  36267
1000000: 0010 2015        5   7  44315

现在，让我们运行代码的第一部分：

#get the range of experience for each firm
wages[,min_exp:=min(exp),by=.(year,firm,position)]
wages[,max_exp:=max(exp),by=.(year,firm,position)]
#Can't interpolate if there are only 2 or 3 unique experience cells represented
wages[,node_count:=length(unique(exp)),by=.(year,firm,position)]
#Nor if there are too few teachers
wages[,ind_count:=.N,by=.(year,firm,position)]
#Also troublesome when there is little variation in salaries like so:
wages[,sal_scale_flag:=mean(abs(salary-mean(salary)))<50,by=.(year,firm,position)]
wages[,sal_count_flag:=length(unique(salary))<5,by=.(year,firm,position)]
> wages
         firm year position exp salary min_exp max_exp node_count ind_count sal_scale_flag sal_count_flag
      1: 0001 1996        4  14  66136       1      40         40      1373          FALSE          FALSE
      2: 0001 1996        4   3  42123       1      40         40      1373          FALSE          FALSE
      3: 0001 1996        4   9  46528       1      40         40      1373          FALSE          FALSE
      4: 0001 1996        4  11  35195       1      40         40      1373          FALSE          FALSE
      5: 0001 1996        4   2  43926       1      40         40      1373          FALSE          FALSE
     ---                                                                                                 
 999996: 0010 2015        5  11  43140       1      40         40      1326          FALSE          FALSE
 999997: 0010 2015        5  23  64025       1      40         40      1326          FALSE          FALSE
 999998: 0010 2015        5  31  35266       1      40         40      1326          FALSE          FALSE
 999999: 0010 2015        5  11  36267       1      40         40      1326          FALSE          FALSE
1000000: 0010 2015        5   7  44315       1      40         40      1326          FALSE          FALSE

我们现在将wages像您之前所做的那样以单线程方式处理。注意，我们先保存原始数据，以便稍后对其进行多线程操作并比较结果：

start <- Sys.time()
salary_scales_1 <-
  wages[node_count>=7&ind_count>=10
        &sal_scale_flag==0&sal_count_flag==0,
        .(exp=0:40,salary=cobs_extrap(exp,salary,min_exp,max_exp)),
        by=.(firm,year,position)]
print(paste("No Parallelisation time: ",Sys.time()-start))
> print(paste("No Parallelisation time: ",Sys.time()-start))
[1] "No Parallelisation time:  1.13971961339315"
> salary_scales_1
       firm year position exp   salary
    1: 0001 1996        4   0 43670.14
    2: 0001 1996        4   1 43674.00
    3: 0001 1996        4   2 43677.76
    4: 0001 1996        4   3 43681.43
    5: 0001 1996        4   4 43684.99
   ---                                
16396: 0010 2015        5  36 44464.02
16397: 0010 2015        5  37 44468.60
16398: 0010 2015        5  38 44471.35
16399: 0010 2015        5  39 44472.27
16400: 0010 2015        5  40 43077.70

处理所有内容大约需要 1 分 8 秒。请注意，在我们的虚拟示例中我们只有 10 家不同的公司，这就是为什么与您的本地结果相比处理时间并不那么重要的原因。

现在，让我们尝试以并行方式执行此任务。如前所述，对于我们的示例，我们希望每年拆分数据并将较小的子部分分配给单独的核心。我们将doParallel为此目的使用该软件包：

我们需要做的第一件事是创建一个具有特定数量核心的集群。在我们的示例中，我们将尝试使用所有可用的内核。接下来，我们必须注册集群并将一些变量导出到子节点的全局环境中。在这种情况下，子节点只需要访问wages. 此外，还需要在节点上评估一些依赖库才能使其工作。在这种情况下，节点需要同时访问data.frame和cobs库。代码如下所示：

library(doParallel)
start <- Sys.time()
cl <- makeCluster(detectCores()); 
registerDoParallel(cl); 
clusterExport(cl,c("wages"),envir=environment());
clusterEvalQ(cl,library("data.table"));
clusterEvalQ(cl,library("cobs"));
salary_scales_2 <- foreach(i = 1996:2015) %dopar%
  {
    subSet <- wages[.(i)] # binary subsetting
    subSet[node_count>=7&ind_count>=10
           &sal_scale_flag==0&sal_count_flag==0,
           .(exp=0:40,
             salary=cobs_extrap(exp,salary,min_exp,max_exp)),
           by=.(firm,year,position)]
  }
stopCluster(cl)
print(paste("With parallelisation time: ",Sys.time()-start))
> print(paste("With parallelisation time: ",Sys.time()-start))
[1] "With parallelisation time:  23.4177722930908"

我们现在有一个数据表列表，salary_scales_2其中包含每个单独年份的子结果。注意处理时间的加快：这次只用了 23 秒，而不是原来的 1.1 分钟（提高了 65%）。我们现在唯一需要做的就是合并结果。我们可以使用do.call("rbind", salary_scales_2)为了将表的行合并在一起（这几乎不需要时间——一次运行 0.002 秒）。最后，我们还进行了一个小检查，以验证多线程结果确实与单线程运行的结果相同：

salary_scales_2<-do.call("rbind",salary_scales_2)
identical(salary_scales_1,salary_scales_2)
> identical(salary_scales_1,salary_scales_2)
[1] TRUE

回复评论 这确实是一个非常有趣的例子，但我认为你可能在这里错过了更重要的问题。确实执行了与data.table内存和结构相关的优化，以便您以更有效的方式查询和访问您的数据。但是，在此示例中，没有主要的内存或搜索相关瓶颈，尤其是当您与cobs函数中的实际总数据处理时间进行比较时。例如，您更改的线路subSet <- wages[year==uniqueYears[i],]在您计时时每次调用仅需大约 0.04 秒。

如果您在运行中使用分析器，那么您会注意到它不是data.table请求并行化的或任何操作或分组，而是cobs占用几乎所有处理时间的函数（而且这个函数甚至不使用 adata.table作为输入）。我们在示例中尝试做的是将cobs函数的总工作负载重新分配给不同的内核，以实现我们的加速。我们的目的不是拆分data.table业务，因为它们根本不昂贵。但是，我们确实必须拆分我们的 data.table，因为我们需要为单独的cobs函数运行拆分数据。事实上，我们甚至利用了data.table在拆分和合并表时在所有方面都很有效。这根本不需要额外的时间。

r - 加快插值练习

1 回答 1

Related

Reference