c - 使用指令内在函数在 Hexagon DSP 中启用 HVX SIMD

Question

我正在使用 Hexagon-SDK 3.0 为 HVX DSP 架构编译我的示例应用程序。有许多与 Hexagon-LLVM 相关的工具可供使用，位于以下文件夹：

~/Qualcomm/HEXAGON_Tools/7.2.12/Tools/bin

我写了一个小例子来计算两个数组的乘积，以确保我可以利用 HVX 硬件加速。但是，当我生成我的程序集时，无论是使用-S，还是使用，-S -emit-llvm我都没有找到任何 HVX 指令的定义，例如vmem，vX等。我的 C 应用程序现在正在执行，hexagon-sim直到我设法找到一种在板也。

据我了解，我需要在 C Intrinsics 中定义我的 HVX 部分代码，但无法调整现有示例以满足我自己的需求。如果有人可以演示如何完成此过程，那就太好了。同样在Hexagon V62 程序员参考手册中，许多内在指令都没有定义。

这是我用纯 C 编写的小应用程序：

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#if defined(__hexagon__)
#include "hexagon_standalone.h"
#include "subsys.h"
#endif
#include "io.h"
#include "hvx.cfg.h"


#define KERNEL_SIZE     9
#define Q               8
#define PRECISION       (1<<Q)

double vectors_dot_prod2(const double *x, const double *y, int n)
{
    double res = 0.0;
    int i = 0;
    for (; i <= n-4; i+=4)
    {
        res += (x[i] * y[i] +
                x[i+1] * y[i+1] +
                x[i+2] * y[i+2] +
                x[i+3] * y[i+3]);
    }
    for (; i < n; i++)
    {
        res += x[i] * y[i];
    }
    return res;
}


int main (int argc, char* argv[])
{
    int n;
    long long start_time, total_cycles;
/* -----------------------------------------------------*/
/*  Allocate memory for input/output                    */
/* -----------------------------------------------------*/
    //double *res  = memalign(VLEN, 4 *sizeof(double));
    const double *x  = memalign(VLEN, n *sizeof(double));
    const double *y  = memalign(VLEN, n *sizeof(double));

    if (  *x  == NULL || *y == NULL ){
        printf("Error: Could not allocate Memory for image\n");
        return 1;
}   
    #if defined(__hexagon__)
        subsys_enable();
        SIM_ACQUIRE_HVX;
    #if LOG2VLEN == 7
        SIM_SET_HVX_DOUBLE_MODE;
    #endif
    #endif

    /* -----------------------------------------------------*/                                                
    /*  Call fuction                                        */
    /* -----------------------------------------------------*/
    RESET_PMU();
    start_time = READ_PCYCLES();
    
    vectors_dot_prod2(x,y,n);

    total_cycles = READ_PCYCLES() - start_time;
    DUMP_PMU();



    printf("Array product of x[i] * y[i] = %f\n",vectors_dot_prod2(x,y,4));

    #if defined(__hexagon__)
        printf("AppReported (HVX%db-mode):  Array product of x[i] * y[i] =%f\n", VLEN, vectors_dot_prod2(x,y,4));
    #endif

return 0;
}

我使用以下方法编译它hexagon-clang：

hexagon-clang -v  -O2 -mv60 -mhvx-double -DLOG2VLEN=7 -I../../common/include -I../include -DQDSP6SS_PUB_BASE=0xFE200000 -o arrayProd.o  -c  arrayProd.c

然后将其与subsys.o（在 DSK 中找到并已编译）链接并-lhexagon生成我的可执行文件：

hexagon-clang -O2 -mv60 -o arrayProd.exe  arrayProd.o subsys.o -lhexagon

最后，使用 sim 运行它：

hexagon-sim -mv60 arrayProd.exe

score 3 · Accepted Answer

有点晚了，但可能仍然有用。

Hexagon Vector eXtensions 不会自动发出，并且当前指令集（从 8.0 SDK 开始）仅支持整数操作，因此编译器不会为包含“double”类型的 C 代码发出任何内容（它类似于 SSE 编程，您必须手动打包xmm 注册并使用 SSE 内在函数来做你需要的）。

您需要定义您的应用程序真正需要什么。例如，如果您正在编写与 3D 相关的内容并且确实需要计算双（或浮点）点积，则可以将浮点数转换为 16.16 定点，然后使用指令（即 C 内在函数） Q6_Vw_vmpyio_VwVh来Q6_Vw_vmpye_VwVuh模拟定点乘法.

要“启用” HVX，您应该使用定义的 HVX 相关类型

#include <hexagon_types.h>
#include <hexagon_protos.h>

像'vmem'和'vmemu'这样的指令会自动发出

// I assume 64-byte mode, no `-mhvx-double`. For 128-byte mode use 32 int array
int values[16] = { 1, 2, 3, ..... };

/* The following line compiles to 
     {
          r4 = __address_of_values
          v1 = vmem(r4 + #0)
     }
   You can get the exact code by using '-S' switch, as you already do
*/
HVX_Vector v = *(HVX_Vector*)values;

您的 dot_product 的（定点）版本可能一次读取 16 个整数，在几条指令中将所有 16 个整数相乘（参见 HVX62 编程手册，有一个从 16 位整数乘法实现 32 位整数乘法的技巧），然后随机播放/交易/错误数据并汇总重新排列的向量以获得点积（这样您几乎可以一次计算 4 个点积，如果您预加载 4 个 HVX 寄存器 - 即 16 个 4D 向量 - 您可以计算 16 个点积在平行下）。

如果您正在做的只是字节/整数图像处理，您可能会在 Hexagon 指令集中使用特定的 16 位和 8 位硬件点积，而不是模拟doubles 和floats。

c - 使用指令内在函数在 Hexagon DSP 中启用 HVX SIMD

1 回答 1

Related

Reference