我有一个简单的 CUDA 内核,可以通过基本归约来进行矢量累加。我将其扩展为能够通过将其拆分为多个块来处理更大的数据。但是,我关于分配适当数量的共享内存以供内核使用的假设因非法内存访问而失败。当我增加这个限制时它就消失了,但我想知道为什么。这是我正在谈论的代码:
核心内核:
__global__ static
void vec_add(int *buffer,
int numElem, // The actual number of elements
int numIntermediates) // The next power of two of numElem
{
extern __shared__ unsigned int interim[];
int index = blockDim.x * blockIdx.x + threadIdx.x;
// Copy global intermediate values into shared memory.
interim[threadIdx.x] =
(index < numElem) ? buffer[index] : 0;
__syncthreads();
// numIntermediates2 *must* be a power of two!
for (unsigned int s = numIntermediates / 2; s > 0; s >>= 1) {
if (threadIdx.x < s) {
interim[threadIdx.x] += interim[threadIdx.x + s];
}
__syncthreads();
}
if (threadIdx.x == 0) {
buffer[blockIdx.x] = interim[0];
}
}
这是调用者:
void accumulate (int* buffer, int numElem)
{
unsigned int numReductionThreads =
nextPowerOfTwo(numElem); // A routine to return the next higher power of 2.
const unsigned int maxThreadsPerBlock = 1024; // deviceProp.maxThreadsPerBlock
unsigned int numThreadsPerBlock, numReductionBlocks, reductionBlockSharedDataSize;
while (numReductionThreads > 1) {
numThreadsPerBlock = numReductionThreads < maxThreadsPerBlock ?
numReductionThreads : maxThreadsPerBlock;
numReductionBlocks = (numReductionThreads + numThreadsPerBlock - 1) / numThreadsPerBlock;
reductionBlockSharedDataSize = numThreadsPerBlock * sizeof(unsigned int);
vec_add <<< numReductionBlocks, numThreadsPerBlock, reductionBlockSharedDataSize >>>
(buffer, numElem, numReductionThreads);
numReductionThreads = nextPowerOfTwo(numReductionBlocks);
}
}
我在 GPU 上使用包含 1152 个元素的样本集尝试了此代码,配置如下:类型:Quadro 600 MaxThreadsPerBlock:1024 MaxSharedMemory:48KB
输出:
Loop 1: numElem = 1152, numReductionThreads = 2048, numReductionBlocks = 2, numThreadsPerBlock = 1024, reductionBlockSharedDataSize = 4096
Loop 2: numElem = 1152, numReductionThreads = 2, numReductionBlocks = 1, numThreadsPerBlock = 2, reductionBlockSharedDataSize = 8
CUDA Error 77: an illegal memory access was encountered
怀疑我的“临时”共享内存导致非法内存访问,我在以下行中任意将共享内存增加了两倍:
reductionBlockSharedDataSize = 2 * numThreadsPerBlock * sizeof(unsigned int);
我的内核开始正常工作了!
我不明白的是 - 为什么我必须提供这个额外的共享内存来让我的问题消失(暂时)。
作为检查这个幻数的进一步实验,我用一个更大的数据集运行了我的代码,数据集有 6912 个点。这一次,即使是 2X 或 4X 也没有用。
Loop 1: numElem = 6912, numReductionThreads = 8192, numReductionBlocks = 8, numThreadsPerBlock = 1024, reductionBlockSharedDataSize = 16384
Loop 2: numElem = 6912, numReductionThreads = 8, numReductionBlocks = 1, numThreadsPerBlock = 8, reductionBlockSharedDataSize = 128
CUDA Error 77: an illegal memory access was encountered
但是当我将共享内存大小增加 8 倍时,问题又消失了。
当然,我不能为越来越大的数据集随意选择这个比例因子,因为我很快就会用完 48KB 的共享内存限制。所以我想知道解决我的问题的合法方法。