ios - MPSMatrixVectorMultiplication 太慢了

Question

我正在研究进行大量数学计算的 GPU 算法，主要与矩阵和向量有关。虽然我在处理时间方面有很好的数字，但我仍然觉得还有改进的空间。

所以我为自己发现了Metal Performance Shaders框架。该框架的描述让我兴奋不已，因为我可以为我的 GPU 算法所做的数学运算找到经过微调和优化的内核着色器。

我决定首先使用MPSMatrixVectorMultiplication，因为我有一个很大的11000x500矩阵乘以向量11000 vector的输出500。

所以这就是我使用它的方式。为s 和操作本身声明MPS包装器：MTLBuffer

MPSMatrix *model;
MPSVector *vector;

id<MTLBuffer> resultBuffer;
MPSVector *resultVector;
MPSMatrixVectorMultiplication *matrixVectorMultiplication;

初始化那些MPS包装器：

matrixVectorMultiplication = [[MPSMatrixVectorMultiplication alloc] initWithDevice:_ctx.device transpose:true rows:500 columns:11000 alpha:1 beta:0];

//......//

MPSVectorDescriptor *desc = [MPSVectorDescriptor vectorDescriptorWithLength:11000 dataType:MPSDataTypeFloat32];
vector = [[MPSVector alloc] initWithBuffer:vecBuffer descriptor:desc];

MPSVectorDescriptor *desc_out = [MPSVectorDescriptor vectorDescriptorWithLength:500 dataType:MPSDataTypeFloat32];
resultVector = [[MPSVector alloc] initWithBuffer:resultBuffer descriptor:desc_out];

//......//

MPSMatrixDescriptor *desc = [MPSMatrixDescriptor matrixDescriptorWithRows:11000 columns:500 rowBytes:500 * sizeof(float) dataType:MPSDataTypeFloat32];  //I need to transpose the matrix     
model = [[MPSMatrix alloc] initWithBuffer:testBuffer descriptor:desc];

并做乘法：

id<MTLCommandBuffer> cmdBuffer = [_ctx.commandQueue commandBuffer];
id<MTLComputeCommandEncoder> encoder = [cmdBuffer computeCommandEncoder];

// work with my own encoder, execute some commands

[encoder endEncoding];

[matrixVectorMultiplication encodeToCommandBuffer:cmdBuffer inputMatrix:model inputVector:vector resultVector:resultVector];

[cmdBuffer commit];
[cmdBuffer waitUntilCompleted]; // I have to wait because my algorithm is sequential at this point

0.8-1.1现在，我编写的核函数在ms左右执行完全相同的乘法运算。我很伤心地发现它MPSMatrixVectorMultiplication做到了18-19 ms！

那太慢了，我不敢相信这样的结果。很明显，我遗漏了一些小细节，这些细节窃取了很多性能。

有没有人MPS在性能敏感的代码中使用过解决方案？我很高兴听到一些可以在我的 GPU 例程中应用的技巧。

提前致谢！

score 0 · Accepted Answer

您还需要做一些额外的事情：

用零填充矩阵，使其大小可被 8 整除
在 MacOS 上（在 iOS 上不可用）使用托管内存缓冲区

有关更多详细信息，请参阅这些链接：

由于内存是在 iOS 上共享的，因此您可能无法获得相同的加速，但使用正确的矩阵大小应该会更快！

ios - MPSMatrixVectorMultiplication 太慢了

1 回答 1

Related

Reference