39

描述

当使用 openmp 的 parallel for 构造分配和释放具有 4 个或更多线程的随机大小的内存块时,程序似乎在测试程序运行时的后半部分开始泄漏大量内存。因此,它将消耗的内存从 1050 MB 增加到 1500 MB 或更多,而实际上并没有使用额外的内存。

由于 valgrind 没有显示任何问题,我必须假设看似内存泄漏实际上是内存碎片的突出影响。

有趣的是,如果 2 个线程每个进行 10000 次分配,效果还没有显示出来,但是如果 4 个线程每个进行 5000 次分配,效果就很明显了。此外,如果分配的块的最大大小减少到 256kb(从 1mb),效果会变弱。

重并发可以那么强调碎片吗?或者这更有可能是堆中的错误?

测试程序说明

构建演示程序以从堆中获取总共 256 MB 随机大小的内存块,执行 5000 次分配。如果达到内存限制,首先分配的块将被释放,直到内存消耗低于限制。一旦执行了 5000 次分配,所有内存都会被释放并且循环结束。所有这些工作都是针对 openmp 生成的每个线程完成的。

这种内存分配方案允许我们预计每个线程的内存消耗约为 260 MB(包括一些簿记数据)。

演示程序

由于这确实是您可能想要测试的东西,因此您可以使用简单的 makefile 从dropbox下载示例程序。

按原样运行程序时,您应该至少有 1400 MB 的可用 RAM。随意调整代码中的常量以满足您的需要。

为了完整起见,实际代码如下:

#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <vector>
#include <deque>

#include <omp.h>
#include <math.h>

typedef unsigned long long uint64_t;

void runParallelAllocTest()
{
    // constants
    const int  NUM_ALLOCATIONS = 5000; // alloc's per thread
    const int  NUM_THREADS = 4;       // how many threads?
    const int  NUM_ITERS = NUM_THREADS;// how many overall repetions

    const bool USE_NEW      = true;   // use new or malloc? , seems to make no difference (as it should)
    const bool DEBUG_ALLOCS = false;  // debug output

    // pre store allocation sizes
    const int  NUM_PRE_ALLOCS = 20000;
    const uint64_t MEM_LIMIT = (1024 * 1024) * 256;   // x MB per process
    const size_t MAX_CHUNK_SIZE = 1024 * 1024 * 1;

    srand(1);
    std::vector<size_t> allocations;
    allocations.resize(NUM_PRE_ALLOCS);
    for (int i = 0; i < NUM_PRE_ALLOCS; i++) {
        allocations[i] = rand() % MAX_CHUNK_SIZE;   // use up to x MB chunks
    }


    #pragma omp parallel num_threads(NUM_THREADS)
    #pragma omp for
    for (int i = 0; i < NUM_ITERS; ++i) {
        uint64_t long totalAllocBytes = 0;
        uint64_t currAllocBytes = 0;

        std::deque< std::pair<char*, uint64_t> > pointers;
        const int myId = omp_get_thread_num();

        for (int j = 0; j < NUM_ALLOCATIONS; ++j) {
            // new allocation
            const size_t allocSize = allocations[(myId * 100 + j) % NUM_PRE_ALLOCS ];

            char* pnt = NULL;
            if (USE_NEW) {
                pnt = new char[allocSize];
            } else {
                pnt = (char*) malloc(allocSize);
            }
            pointers.push_back(std::make_pair(pnt, allocSize));

            totalAllocBytes += allocSize;
            currAllocBytes  += allocSize;

            // fill with values to add "delay"
            for (int fill = 0; fill < (int) allocSize; ++fill) {
                pnt[fill] = (char)(j % 255);
            }


            if (DEBUG_ALLOCS) {
                std::cout << "Id " << myId << " New alloc " << pointers.size() << ", bytes:" << allocSize << " at " << (uint64_t) pnt << "\n";
            }

            // free all or just a bit
            if (((j % 5) == 0) || (j == (NUM_ALLOCATIONS - 1))) {
                int frees = 0;

                // keep this much allocated
                // last check, free all
                uint64_t memLimit = MEM_LIMIT;
                if (j == NUM_ALLOCATIONS - 1) {
                    std::cout << "Id " << myId << " about to release all memory: " << (currAllocBytes / (double)(1024 * 1024)) << " MB" << std::endl;
                    memLimit = 0;
                }
                //MEM_LIMIT = 0; // DEBUG

                while (pointers.size() > 0 && (currAllocBytes > memLimit)) {
                    // free one of the first entries to allow previously obtained resources to 'live' longer
                    currAllocBytes -= pointers.front().second;
                    char* pnt       = pointers.front().first;

                    // free memory
                    if (USE_NEW) {
                        delete[] pnt;
                    } else {
                        free(pnt);
                    }

                    // update array
                    pointers.pop_front();

                    if (DEBUG_ALLOCS) {
                        std::cout << "Id " << myId << " Free'd " << pointers.size() << " at " << (uint64_t) pnt << "\n";
                    }
                    frees++;
                }
                if (DEBUG_ALLOCS) {
                    std::cout << "Frees " << frees << ", " << currAllocBytes << "/" << MEM_LIMIT << ", " << totalAllocBytes << "\n";
                }
            }
        } // for each allocation

        if (currAllocBytes != 0) {
            std::cerr << "Not all free'd!\n";
        }

        std::cout << "Id " << myId << " done, total alloc'ed " << ((double) totalAllocBytes / (double)(1024 * 1024)) << "MB \n";
    } // for each iteration

    exit(1);
}

int main(int argc, char** argv)
{
    runParallelAllocTest();

    return 0;
}

测试系统

从我目前看到的情况来看,硬件很重要。如果在更快的机器上运行,测试可能需要调整。

Intel(R) Core(TM)2 Duo CPU     T7300  @ 2.00GHz
Ubuntu 10.04 LTS 64 bit
gcc 4.3, 4.4, 4.6
3988.62 Bogomips

测试

一旦你执行了 makefile,你应该得到一个名为ompmemtest. 为了查询一段时间内的内存使用情况,我使用了以下命令:

./ompmemtest &
top -b | grep ompmemtest

这会产生令人印象深刻的碎片或泄漏行为。4 个线程的预期内存消耗为1090 MB,随着时间的推移变为1500 MB:

PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
11626 byron     20   0  204m  99m 1000 R   27  2.5   0:00.81 ompmemtest                                                                              
11626 byron     20   0  992m 832m 1004 R  195 21.0   0:06.69 ompmemtest                                                                              
11626 byron     20   0 1118m 1.0g 1004 R  189 26.1   0:12.40 ompmemtest                                                                              
11626 byron     20   0 1218m 1.0g 1004 R  190 27.1   0:18.13 ompmemtest                                                                              
11626 byron     20   0 1282m 1.1g 1004 R  195 29.6   0:24.06 ompmemtest                                                                              
11626 byron     20   0 1471m 1.3g 1004 R  195 33.5   0:29.96 ompmemtest                                                                              
11626 byron     20   0 1469m 1.3g 1004 R  194 33.5   0:35.85 ompmemtest                                                                              
11626 byron     20   0 1469m 1.3g 1004 R  195 33.6   0:41.75 ompmemtest                                                                              
11626 byron     20   0 1636m 1.5g 1004 R  194 37.8   0:47.62 ompmemtest                                                                              
11626 byron     20   0 1660m 1.5g 1004 R  195 38.0   0:53.54 ompmemtest                                                                              
11626 byron     20   0 1669m 1.5g 1004 R  195 38.2   0:59.45 ompmemtest                                                                              
11626 byron     20   0 1664m 1.5g 1004 R  194 38.1   1:05.32 ompmemtest                                                                              
11626 byron     20   0 1724m 1.5g 1004 R  195 40.0   1:11.21 ompmemtest                                                                              
11626 byron     20   0 1724m 1.6g 1140 S  193 40.1   1:17.07 ompmemtest

请注意:使用gcc 4.3、4.4 和 4.6(trunk)编译时,我可以重现此问题。

4

3 回答 3

22

好的,拿起诱饵。

这是在一个系统上

Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz
4x5666.59 bogomips

Linux meerkat 2.6.35-28-generic-pae #50-Ubuntu SMP Fri Mar 18 20:43:15 UTC 2011 i686 GNU/Linux

gcc version 4.4.5

             total       used       free     shared    buffers     cached
Mem:       8127172    4220560    3906612          0     374328    2748796
-/+ buffers/cache:    1097436    7029736
Swap:            0          0          0

天真的运行

我刚跑了

time ./ompmemtest 
Id 0 about to release all memory: 258.144 MB
Id 0 done, total alloc'ed -1572.7MB 
Id 3 about to release all memory: 257.854 MB
Id 3 done, total alloc'ed -1569.6MB 
Id 1 about to release all memory: 257.339 MB
Id 2 about to release all memory: 257.043 MB
Id 1 done, total alloc'ed -1570.42MB 
Id 2 done, total alloc'ed -1569.96MB 

real    0m13.429s
user    0m44.619s
sys 0m6.000s

没什么了不起的。这是同时输出的vmstat -S M 1

vmstat 原始数据

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 0  0      0   3892    364   2669    0    0    24     0  701 1487  2  1 97  0
 4  0      0   3421    364   2669    0    0     0     0 1317 1953 53  7 40  0
 4  0      0   2858    364   2669    0    0     0     0 2715 5030 79 16  5  0
 4  0      0   2861    364   2669    0    0     0     0 6164 12637 76 15  9  0
 4  0      0   2853    364   2669    0    0     0     0 4845 8617 77 13 10  0
 4  0      0   2848    364   2669    0    0     0     0 3782 7084 79 13  8  0
 5  0      0   2842    364   2669    0    0     0     0 3723 6120 81 12  7  0
 4  0      0   2835    364   2669    0    0     0     0 3477 4943 84  9  7  0
 4  0      0   2834    364   2669    0    0     0     0 3273 4950 81 10  9  0
 5  0      0   2828    364   2669    0    0     0     0 3226 4812 84 11  6  0
 4  0      0   2823    364   2669    0    0     0     0 3250 4889 83 10  7  0
 4  0      0   2826    364   2669    0    0     0     0 3023 4353 85 10  6  0
 4  0      0   2817    364   2669    0    0     0     0 3176 4284 83 10  7  0
 4  0      0   2823    364   2669    0    0     0     0 3008 4063 84 10  6  0
 0  0      0   3893    364   2669    0    0     0     0 4023 4228 64 10 26  0

这些信息对你有什么意义吗?

Google 线程缓存 Malloc

现在为了真正的乐趣,添加一点香料

time LD_PRELOAD="/usr/lib/libtcmalloc.so" ./ompmemtest 
Id 1 about to release all memory: 257.339 MB
Id 1 done, total alloc'ed -1570.42MB 
Id 3 about to release all memory: 257.854 MB
Id 3 done, total alloc'ed -1569.6MB 
Id 2 about to release all memory: 257.043 MB
Id 2 done, total alloc'ed -1569.96MB 
Id 0 about to release all memory: 258.144 MB
Id 0 done, total alloc'ed -1572.7MB 

real    0m11.663s
user    0m44.255s
sys 0m1.028s

看起来更快,不是吗?

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 4  0      0   3562    364   2684    0    0     0     0 1041 1676 28  7 64  0
 4  2      0   2806    364   2684    0    0     0   172 1641 1843 84 14  1  0
 4  0      0   2758    364   2685    0    0     0     0 1520 1009 98  2  1  0
 4  0      0   2747    364   2685    0    0     0     0 1504  859 98  2  0  0
 5  0      0   2745    364   2685    0    0     0     0 1575 1073 98  2  0  0
 5  0      0   2739    364   2685    0    0     0     0 1415  743 99  1  0  0
 4  0      0   2738    364   2685    0    0     0     0 1526  981 99  2  0  0
 4  0      0   2731    364   2685    0    0     0   684 1536  927 98  2  0  0
 4  0      0   2730    364   2685    0    0     0     0 1584 1010 99  1  0  0
 5  0      0   2730    364   2685    0    0     0     0 1461  917 99  2  0  0
 4  0      0   2729    364   2685    0    0     0     0 1561 1036 99  1  0  0
 4  0      0   2729    364   2685    0    0     0     0 1406  756 100  1  0  0
 0  0      0   3819    364   2685    0    0     0     4 1159 1476 26  3 71  0

如果您想比较 vmstat 输出

Valgrind --tool massif

这是ms_printafter valgrind --tool=massif ./ompmemtest(默认 malloc)的输出头:

--------------------------------------------------------------------------------
Command:            ./ompmemtest
Massif arguments:   (none)
ms_print arguments: massif.out.beforetcmalloc
--------------------------------------------------------------------------------


    GB
1.009^                                                                     :  
     |       ##::::@@:::::::@@::::::@@::::@@::@::::@::::@:::::::::@::::::@::: 
     |       # :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: 
     |       # :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: 
     |      :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: 
     |      :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: 
     |      :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |   ::::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |   : ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |   : ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |  :: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |  :: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   264.0

Number of snapshots: 63
 Detailed snapshots: [6 (peak), 10, 17, 23, 27, 30, 35, 39, 48, 56]

谷歌 HEAPPROFILE

不幸的是,vanillavalgrind不适用于tcmalloc,所以我将 horses midrace 转换为 heap profilinggoogle-perftools

gcc openMpMemtest_Linux.cpp -fopenmp -lgomp -lstdc++ -ltcmalloc -o ompmemtest

time HEAPPROFILE=/tmp/heapprofile ./ompmemtest
Starting tracking the heap
Dumping heap profile to /tmp/heapprofile.0001.heap (100 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0002.heap (200 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0003.heap (300 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0004.heap (400 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0005.heap (501 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0006.heap (601 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0007.heap (701 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0008.heap (801 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0009.heap (902 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0010.heap (1002 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0011.heap (2029 MB allocated cumulatively, 1031 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0012.heap (3053 MB allocated cumulatively, 1030 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0013.heap (4078 MB allocated cumulatively, 1031 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0014.heap (5102 MB allocated cumulatively, 1031 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0015.heap (6126 MB allocated cumulatively, 1033 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0016.heap (7151 MB allocated cumulatively, 1029 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0017.heap (8175 MB allocated cumulatively, 1029 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0018.heap (9199 MB allocated cumulatively, 1028 MB currently in use)
Id 0 about to release all memory: 258.144 MB
Id 0 done, total alloc'ed -1572.7MB 
Id 2 about to release all memory: 257.043 MB
Id 2 done, total alloc'ed -1569.96MB 
Id 3 about to release all memory: 257.854 MB
Id 3 done, total alloc'ed -1569.6MB 
Id 1 about to release all memory: 257.339 MB
Id 1 done, total alloc'ed -1570.42MB 
Dumping heap profile to /tmp/heapprofile.0019.heap (Exiting)

real    0m11.981s
user    0m44.455s
sys 0m1.124s

联系我获取完整的日志/详细信息

更新

对评论:我更新了程序

--- omptest/openMpMemtest_Linux.cpp 2011-05-03 23:18:44.000000000 +0200
+++ q/openMpMemtest_Linux.cpp   2011-05-04 13:42:47.371726000 +0200
@@ -13,8 +13,8 @@
 void runParallelAllocTest()
 {
    // constants
-   const int  NUM_ALLOCATIONS = 5000; // alloc's per thread
-   const int  NUM_THREADS = 4;       // how many threads?
+   const int  NUM_ALLOCATIONS = 55000; // alloc's per thread
+   const int  NUM_THREADS = 8;        // how many threads?
    const int  NUM_ITERS = NUM_THREADS;// how many overall repetions

    const bool USE_NEW      = true;   // use new or malloc? , seems to make no difference (as it should)

它运行了超过 5 立方米。接近尾声,htop 的截图告诉我们,保留集确实略高,接近 2.3g:

  1  [||||||||||||||||||||||||||||||||||||||||||||||||||96.7%]     Tasks: 125 total, 2 running
  2  [||||||||||||||||||||||||||||||||||||||||||||||||||96.7%]     Load average: 8.09 5.24 2.37 
  3  [||||||||||||||||||||||||||||||||||||||||||||||||||97.4%]     Uptime: 01:54:22
  4  [||||||||||||||||||||||||||||||||||||||||||||||||||96.1%]
  Mem[|||||||||||||||||||||||||||||||             3055/7936MB]
  Swp[                                                  0/0MB]

  PID USER     NLWP PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
 4330 sehe        8  20   0 2635M 2286M   908 R 368. 28.8 15:35.01 ./ompmemtest

与 tcmalloc 运行比较结果:4m12s,相似的 top stats有细微差别;最大的区别在于 VIRT 集(但这并不是特别有用,除非每个进程的地址空间非常有限?)。如果你问我,RES 集非常相似。更需要注意的是增加了并行度;现在所有核心都已用尽。这显然是由于在使用 tcmalloc 时减少了对堆操作的锁定需求:

如果空闲列表为空: (1) 我们从一个中心空闲列表中为该大小级别获取一堆对象(中心空闲列表由所有线程共享)。(2) 将它们放在线程本地空闲列表中。(3) 将新获取的对象之一返回给应用程序。

  1  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]     Tasks: 172 total, 2 running
  2  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]     Load average: 7.39 2.92 1.11 
  3  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]     Uptime: 11:12:25
  4  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
  Mem[||||||||||||||||||||||||||||||||||||||||||||              3278/7936MB]
  Swp[                                                                0/0MB]

  PID USER     NLWP PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
14391 sehe        8  20   0 2251M 2179M  1148 R 379. 27.5  8:08.92 ./ompmemtest
于 2011-05-03T22:06:23.027 回答
1

将测试程序与google 的 tcmalloc库链接时,可执行文件不仅运行速度提高了约 10%,而且还显示出显着减少或微不足道的内存碎片:

PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
13441 byron     20   0  379m 334m 1220 R  187  8.4   0:02.63 ompmemtestgoogle                                                                        
13441 byron     20   0 1085m 1.0g 1220 R  194 26.2   0:08.52 ompmemtestgoogle                                                                        
13441 byron     20   0 1111m 1.0g 1220 R  195 26.9   0:14.42 ompmemtestgoogle                                                                        
13441 byron     20   0 1131m 1.1g 1220 R  195 27.4   0:20.30 ompmemtestgoogle                                                                        
13441 byron     20   0 1137m 1.1g 1220 R  195 27.6   0:26.19 ompmemtestgoogle                                                                        
13441 byron     20   0 1137m 1.1g 1220 R  195 27.6   0:32.05 ompmemtestgoogle                                                                        
13441 byron     20   0 1149m 1.1g 1220 R  191 27.9   0:37.81 ompmemtestgoogle                                                                        
13441 byron     20   0 1149m 1.1g 1220 R  194 27.9   0:43.66 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1220 R  188 28.2   0:49.32 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1220 R  194 28.2   0:55.15 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1220 R  191 28.2   1:00.90 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1220 R  191 28.2   1:06.64 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1356 R  192 28.2   1:12.42 ompmemtestgoogle

从我掌握的数据来看,答案似乎是:

如果使用的堆库不能很好地处理并发访问,并且如果处理器无法真正并发地执行线程,那么对堆的多线程访问会强调碎片

tcmalloc 库在运行之前导致约 400MB 碎片丢失的同一程序时没有显示出明显的内存碎片。

但是为什么会这样呢?

我必须在这里提供的最好的想法是堆内的某种锁定工件。

测试程序将分配随机大小的内存块,释放程序早期分配的块以保持在其内存限制内。当一个线程正在释放位于“左侧”堆块中的内存时,它实际上可能会在另一个线程计划运行时停止,从而在该堆块上留下一个(软)锁。新调度的线程想要分配内存,但可能甚至不会读取“左侧”的堆块来检查空闲内存,因为它当前正在更改。因此,它最终可能会不必要地从“右侧”使用新的堆块。

这个过程可能看起来像一个堆块移位,其中第一个块(在左侧)仅保持稀疏使用和碎片化,迫使在右侧使用新块。

让我们重申一下,只有当我在双核系统上使用 4 个或更多线程时才会出现这种碎片问题,而双核系统只能或多或少地同时处理两个线程。当只使用两个线程时,堆上的(软)锁将保持足够短,不会阻塞想要分配内存的另一个线程。

另外,作为免责声明,我没有检查 glibc 堆实现的实际代码,我在内存分配器领域也只是新手——我所写的只是它在我看来的样子,这使它成为纯粹的猜测。

另一个有趣的阅读可能是tcmalloc 文档,其中说明了堆和多线程访问的常见问题,其中一些可能在测试程序中也发挥了作用。

值得注意的是,它永远不会将内存返回给系统(请参阅tcmalloc 文档中的警告段落)

于 2011-05-04T11:25:20.930 回答
0

是的,默认 malloc(取决于 linux 版本)做了一些疯狂的事情,在一些多线程应用程序中会大量失败。具体来说,它几乎保留每个线程堆(竞技场)以避免锁定。这比所有线程的单个堆要快得多,但大量内存效率低下(有时)。您可以使用这样的代码来调整它,它会关闭多个竞技场(这会降低性能,所以如果您有很多小分配,请不要这样做!)

rv = mallopt(-7, 1);  // M_ARENA_TEST
rv = mallopt(-8, 1);  // M_ARENA_MAX

或者正如其他人建议的那样,使用 malloc 的各种替代品。

基本上,通用 malloc 不可能始终高效,因为它不知道将如何使用它。

克里斯P。

于 2018-02-07T22:08:29.487 回答