cuda - 使用 CUDA Profiler nvprof 进行内存访问

Question

我正在使用 nvprof 来获取以下 CUDA 代码的全局内存访问次数。内核中加载次数为 36（访问 d_In 数组），内核中存储次数为 36+36（访问 d_Out 数组和 d_rows 数组）。因此，全局内存加载的总数为 36，全局内存存储的数量为 72。但是，当我使用 nvprof CUDA 分析器分析代码时，它会报告以下内容：（基本上我想计算全局内存访问的计算(CGMA) 比率)

      1                gld_transactions        Global Load Transactions           6           6           6
      1                gst_transactions       Global Store Transactions          11          11          11
      1            l2_read_transactions            L2 Read Transactions         133         133         133
      1           l2_write_transactions           L2 Write Transactions          24          24          24


#include <stdio.h>
#include "cuda_profiler_api.h"
__constant__ int crows;

__global__ void kernel(double *d_In, double *d_Out, int *d_rows){
        int tx=threadIdx.x;
        int bx=blockIdx.x;
        int n=bx*blockDim.x+tx;
        if(n < 36){
                d_Out[n]=d_In[n]+1;
                d_rows[n]=crows;
        }
        return;
}

int main(int argc,char **argv){

     double I[36]={1,5,9,2,6,10,3,7,11,4,8,12,13,17,21,14,18,22,15,19,23,16,20,24,25,29,33,26,30,34,27,31,35,28,32,36};

     double *d_In;
     double *d_Out;
     int *d_rows;

     double Iout[36];
     int rows=5;
     int h_rows[36];

     cudaMemcpyToSymbol(crows,&rows,sizeof(int));
     cudaMalloc(&d_In,sizeof(double)*36);
     cudaMalloc(&d_Out,sizeof(double)*36);
     cudaMalloc(&d_rows,sizeof(int)*36);

     cudaMemcpy(d_In,I,sizeof(double)*36,cudaMemcpyHostToDevice);

     dim3 dimGrid(4,1,1);
     dim3 dimBlock(10,1,1);

     cudaProfilerStart();
     kernel<<<dimGrid,dimBlock>>>(d_In,d_Out,d_rows);
     cudaProfilerStop();

     cudaMemcpy(Iout,d_Out,sizeof(double)*36,cudaMemcpyDeviceToHost);
      cudaMemcpy(h_rows,d_rows,sizeof(int)*36,cudaMemcpyDeviceToHost);


    int i;
     for(i=0;i<36;i++)
       printf("%f %d\n",Iout[i],h_rows[i]);


}

有人能帮我吗？谢谢

score 2 · Accepted Answer

通常会问一个比“有人可以帮助我吗？”更具体的问题。如图所示，您的代码没有浮点运算（+、* 等），因此没有要计算的 CGMA（它为零）。

关于内存事务，您的代码有 4 个线程块：

 dim3 dimGrid(4,1,1);

每个线程块可以在单独的多处理器上运行。每个块中有 10 个线程。以下代码行：

            d_Out[n]=d_In[n]+1;

将生成至少一个全局加载事务 ( d_In) 和一个全局存储事务 ( d_Out) 来服务线程。第四个块将有线程，其n活动线程的全局索引 () 将为 30-35。当这个块执行上面的代码行时，它会产生两个全局加载和两个全局存储事务，因为线程需要两个缓存线来服务它们的请求。所以这一行代码可能会产生5个全局加载事务和5个全局存储事务。

出于类似的原因，下一行代码：

            d_rows[n]=crows;

可能会产生 5 个额外的全局存储事务。所以你的探查器输出：

  1                gld_transactions        Global Load Transactions           6           6           6
  1                gst_transactions       Global Store Transactions          11

我相信我已经解释了 6 个全局加载事务中的 5 个，以及 11 个全局存储事务中的 10 个。希望这足以让您了解这些数字的来源。

cuda - 使用 CUDA Profiler nvprof 进行内存访问

1 回答 1

Related

Reference