c++ - 为什么处理多个数据流比处理一个数据流慢？

Question

我正在测试读取多个数据流如何影响 CPU 缓存性能。我正在使用以下代码对此进行基准测试。基准测试读取顺序存储在内存中的整数并顺序写回部分和。从中读取的顺序块的数量是变化的。以循环方式读取来自块的整数。

#include <iostream>
#include <vector>
#include <chrono>
using std::vector;
void test_with_split(int num_arrays) {
    int num_values = 100000000;
    // Fix up the number of values. The effect of this should be insignificant.
    num_values -= (num_values % num_arrays);
    int num_values_per_array = num_values / num_arrays;
    // Initialize data to process
    auto results = vector<int>(num_values);
    auto arrays = vector<vector<int>>(num_arrays);
    for (int i = 0; i < num_arrays; ++i) {
        arrays.emplace_back(num_values_per_array);
    }
    for (int i = 0; i < num_values; ++i) {
        arrays[i%num_arrays].emplace_back(i);
        results.emplace_back(0);
    }
    // Try to clear the cache
    const int size = 20*1024*1024; // Allocate 20M. Set much larger then L2
    char *c = (char *)malloc(size);
    for (int i = 0; i < 100; i++)
        for (int j = 0; j < size; j++)
            c[j] = i*j;
    free(c);
    auto start = std::chrono::high_resolution_clock::now();
    // Do the processing
    int sum = 0;
    for (int i = 0; i < num_values; ++i) {
        sum += arrays[i%num_arrays][i/num_arrays];
        results[i] = sum;
    }
    std::cout << "Time with " << num_arrays << " arrays: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start).count() << " ms\n";
}
int main() {
    int num_arrays = 1;
    while (true) {
        test_with_split(num_arrays++);
    }
}

以下是在 Intel Core 2 Quad CPU Q9550 @ 2.83GHz 上拆分 1-80 路的时间：

拆分为不同数量的流所花费的时间

8 个流之后的速度提升对我来说是有意义的，因为处理器具有 8 路关联 L1 缓存。24 路关联 L2 缓存反过来解释了 24 个流的颠簸。如果我得到的效果与为什么一个循环比两个循环慢这么多？，其中多个大分配总是在同一个关联集中结束。为了比较，我在最后一个大块中完成分配的时间包括在内。

但是，我并不完全理解从一个流到两个流的颠簸。我自己的猜测是它与预取到 L1 缓存有关。阅读Intel 64 and IA-32 Architectures Optimization Reference Manual似乎 L2 流式预取器支持跟踪多达 32 个数据流，但没有为 L1 流式预取器提供此类信息。L1 预取器是否无法跟踪多个流，或者这里还有其他东西在起作用？

背景

我正在对此进行调查，因为我想了解将游戏引擎中的实体组织为数组结构样式中的组件如何影响性能。目前看来，转换所需的数据在两个组件中而不是在 8-10 个组件中对于现代 CPU 来说并不重要。但是，上面的测试表明，有时避免将某些数据拆分为多个组件可能是有意义的，如果这将允许“瓶颈”转换仅使用一个组件，即使这意味着某些其他转换必须读取它是不感兴趣。

在一个块中分配

如果改为分配多个数据块，则仅以跨步方式分配和访问一个数据块时的时间如下。这不会将凹凸从一个流变为两个，但为了完整起见，我将其包括在内。

只分配一个大块的时间

这是修改后的代码：

void test_with_split(int num_arrays) {
    int num_values = 100000000;
    num_values -= (num_values % num_arrays);
    int num_values_per_array = num_values / num_arrays;

    // Initialize data to process
    auto results = vector<int>(num_values);
    auto array = vector<int>(num_values);
    for (int i = 0; i < num_values; ++i) {
        array.emplace_back(i);
        results.emplace_back(0);
    }

    // Try to clear the cache
    const int size = 20*1024*1024; // Allocate 20M. Set much larger then L2
    char *c = (char *)malloc(size);
    for (int i = 0; i < 100; i++)
        for (int j = 0; j < size; j++)
            c[j] = i*j;
    free(c);

    auto start = std::chrono::high_resolution_clock::now();
    // Do the processing
    int sum = 0;
    for (int i = 0; i < num_values; ++i) {
        sum += array[(i%num_arrays)*num_values_per_array+i/num_arrays];
        results[i] = sum;
    }
    std::cout << "Time with " << num_arrays << " arrays: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start).count() << " ms\n";
}

编辑 1

我确保 1 对 2 拆分的差异不是由于编译器展开循环并以不同方式优化第一次迭代。使用__attribute__ ((noinline))我确保工作函数没有内联到主函数中。我通过查看生成的程序集验证了它没有发生。这些改变之后的时间是一样的。

score 1 · Accepted Answer

要回答您问题的主要部分：L1 预取器是否能够跟踪多个流？

不，这实际上是因为 L1 缓存根本没有预取器。L1 缓存不够大，无法冒险推测性地获取可能未使用的数据。这将导致过多的驱逐并对任何未以适合该特定 L1 缓存预测方案的特定模式读取数据的软件产生不利影响。相反，L1 缓存已显式读取或写入的数据。 L1 缓存仅在您写入数据和重新读取最近访问过的数据时才有用。

L1 缓存实现并不是您的配置文件从 1X 到 2X 阵列深度的原因。在像您设置的那样进行流式读取时，L1 缓存对性能的影响很小或根本没有影响。您的大部分读取都直接来自 L2 缓存。在您使用嵌套向量的第一个示例中，可能从 L1 中提取了一些读取（见下文）。

我的猜测是你从 1X 到 2X 的提升与算法以及编译器如何优化它有很大关系。如果编译器知道num_arrays是一个等于 1 的常数，那么它将自动为您消除大量的每次迭代开销。

现在对于第二部分，至于为什么第二个版本更快？：

第二个版本更快的原因与其说是数据在物理内存中的排列方式，不如说是嵌套std::vector<std::vector<int>>类型的底层逻辑更改意味着什么。

在嵌套（第一种）情况下，编译后的代码执行以下步骤：

访问顶级std::vector类。此类包含指向数据数组开头的指针。
该指针值必须从内存中加载。
将当前循环偏移量添加[i%num_arrays]到该指针。
访问嵌套std::vector类数据。（可能 L1 缓存命中）
加载指向向量开始的数据数组的指针。（可能 L1 缓存命中）
添加循环偏移[i/num_arrays]
读取数据。最后！

（请注意，在 24 个左右的流之后，第 4 步和第 5 步获得 L1 缓存命中的机会会急剧下降，因为在通过循环的下一次迭代之前可能会被驱逐）

相比之下，第二个版本：

访问顶级std::vector类。
加载指向向量开始的数据数组的指针。
添加偏移量[(i%num_arrays)*num_values_per_array+i/num_arrays]
读取数据！

删除了一整套引擎盖下的步骤。偏移量的计算稍长一些，因为它需要额外的乘以num_values_per_array。但其他步骤不仅仅是弥补它。

c++ - 为什么处理多个数据流比处理一个数据流慢？

背景

在一个块中分配

编辑 1

1 回答 1

Related

Reference