numpy - Is there a GPU accelerated numpy.max(X, axis=0) implementation in Theano?

Question

Do we have a GPU accelerated of version of numpy.max(X, axis=None) in Theano. I looked into the documentation and found theano.tensor.max(X, axis=None), but it is 4-5 times slower than the numpy implementation.

I can assure you, it is not slow because of some bad choice of matrix size. Same matrix under theano.tensor.exp is 40 times faster than its numpy counterpart.

Any suggestions?

score 5 · Accepted Answer

前面的回答是片面的。该建议不应该起作用，因为解决方法是最终编译代码中使用的解决方法。有优化会自动进行这种转换。

问题的标题与内容不同。它们因轴参数而异。我会回答这两个问题。

如果轴为 0 或无，我们在 GPU 上支持该矩阵操作。如果轴是无，我们的基本实现没有得到很好的优化，因为它更难并行化。如果轴为 0，我们有一个基本的实现，但它更快，因为它更容易并行化。

另外，你的时间是怎么做的？如果您只使用该操作创建一个功能并通过 device=gpu 标志对其进行测试以进行比较，这将包括 CPU 和 GPU 之间的传输时间。这是一个内存绑定操作，所以如果你在你的时间安排中包含传输，我个人认为这种情况下不会有任何速度操作。要仅查看 GPU 操作，请使用 Theano 分析器：使用 Theano 标志 profile=True 运行。

score 3 · Accepted Answer

max和exp操作是根本不同的；exp（以及其他操作，如加法sin等）是一种可并行化的元素操作，同时max需要一种并行处理扫描算法，该算法基本上在数组上构建成对比较树。提速不是不可能max，但也不是那么容易exp。

无论如何，theano实现max基本上由以下几行组成（在 theano/tensor/basic.py 中）：

try:
    out = max_and_argmax(x, axis)[0]
except Exception:
    out = CAReduce(scal.maximum, axis)(x)

在哪里max_and_argmax是一堆自定义代码，在我看来，使用实现 max+argmax 操作numpy，并且CAReduce是用作后备的通用 GPU 加速扫描操作（根据评论，不支持grad等）。您可以尝试直接使用回退，看看是否更快，可能是这样的：

from theano.tensor.elemwise import CAReduce
from theano.scalar import maximum

def mymax(X, axis=None):
    CAReduce(maximum, axis)(X)

numpy - Is there a GPU accelerated numpy.max(X, axis=0) implementation in Theano?

2 回答 2

Related

Reference