python - Python 2.6 vs 2.7 多线程性能问题（futex）

Question

我有一个简单的 Monte-Carlo Pi 计算程序。我尝试在 2 个不同的机器上运行它（相同的硬件，内核版本略有不同）。我在一个案例中看到了显着的性能下降（两次）。没有线程，性能基本相同。程序的分析执行表明，较慢的程序在每次 futex 调用上花费的时间较少。

这与任何内核参数有关吗？
CPU 标志会影响 futex 性能吗？/proc/cpuinfo 表示 cpu 标志略有不同。
这是否与python版本有关？

Linux(3.10.0-123.20.1 (Red Hat 4.4.7-16)) Python 2.6.6

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
99.69   53.229549           5  10792796   5385605 futex

Profile Output
============== 
256 function calls in 26.189 CPU seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   39   26.186    0.671   26.186    0.671 :0(acquire)

Linux(3.10.0-514.26.2 (Red Hat 4.8.5-11)) Python 2.7.5

 % time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.69   94.281979           8  11620358   5646413 futex

Profile Output
==============
259 function calls in 53.448 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 38   53.445    1.406   53.445    1.406 :0(acquire)

测试程序

import random
import math
import time
import threading
import sys
import profile

def find_pi(tid, n):
    t0 = time.time()
    in_circle = 0
    for i in range(n):
        x = random.random()
        y = random.random()

        dist = math.sqrt(pow(x, 2) + pow(y, 2))
        if dist < 1:
            in_circle += 1

    pi = 4.0 * (float(in_circle)/float(n))
    print 'Pi=%s - thread(%s) time=%.3f sec' % (pi, tid, time.time() - t0)
    return pi

def main():
        if len(sys.argv) > 1:
            n = int(sys.argv[1])
        else:
            n = 6000000

        t0 = time.time()
        threads = []
        num_threads = 5
        print 'n =', n
        for tid in range(num_threads):
            t = threading.Thread(target=find_pi, args=(tid,n,))
            threads.append(t)
            t.start()

        for t in threads:
                t.join()

#main()
profile.run('main()')
#profile.run('find_pi(1, 6000000)')

score 0 · Accepted Answer

我不熟悉内核和CPU标志，所以我不能告诉你CPU标志或内核标志会影响结果。

所以它并没有回答你所有的问题，只是满足我的兴趣，使用不同的 python 版本（2.6.6、2.7.5）在 CentOS 7.4.1708（Linux 3.10.0-693.2.2.el7.x86_64 x86_64）上测试了你的代码, 3.6.3)

Python 版本 2.6.6

Profile Output
==============
256 function calls in 19.838 CPU seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    39   19.019    0.488   19.019    0.488 :0(acquire)
    18    0.000    0.000    0.000    0.000 :0(allocate_lock)
    13    0.000    0.000    0.000    0.000 :0(append)
...

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 98.98    6.319220          55    114693      2293 futex
  1.03    0.068830           1     55485           madvise
  0.10    0.006869          95        72           munmap
...

Python 版本 2.7.5

Profile Output
==============
247 function calls in 23.293 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    34   22.717    0.668   22.717    0.668 :0(acquire)
    18    0.047    0.003    0.047    0.003 :0(allocate_lock)
    13    0.000    0.000    0.000    0.000 :0(append)
...

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.54    7.360687         196     37613       667 futex
  0.04    0.002798           4       629       492 open
  0.01    0.000918           4       235       203 stat
...

Python 版本 3.6.3

Profile Output
==============
213 function calls in 17.818 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     5    0.000    0.000    0.000    0.000 :0(__enter__)
     5    0.000    0.000    0.000    0.000 :0(__exit__)
    25   15.923    0.637   15.923    0.637 :0(acquire)
...

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 83.71    0.032639         244       134        38 futex 
  1.90    0.000742           1       849           clock_gettime
  1.74    0.000680           4       160           mmap
...

多次执行后，我得到了几乎相同的结果，所以我选择了随机结果。Python 2.6.6 比 2.7.5 稍快，3.6.3 比 2.6.6 稍快。

结果与strace2.6.6 和 2.7.5 几乎相同，但 3.6.3 的结果却大不相同。

所以，在你的问题中，

这与任何内核参数有关吗？

CPU 标志会影响 futex 性能吗？/proc/cpuinfo 表示 cpu 标志略有不同。

我不知道，

这是否与python版本有关？

是的。

score 0 · Accepted Answer

看起来这很可能是由于这两个版本之间的内核代码发生了一些变化。内核的 futex 代码中存在错误，导致某些进程死锁。修复该错误可能会导致此性能下降。3.10.0-514（针对 CentOS）的更新日志提到了对[kernel] futex.

score -1 · Accepted Answer

我不认为你可以得到严格的答案。

Futex 是与内核有关的东西。这是手册页。

tl; dr - 例如 - 线程由内核调度，如果高优先级线程被低优先级线程阻塞，则存在称为优先级反转的东西。所以观察到的下降可能是由于内核标志。另一点是获得时间 - 进入内核以获得实时价值。

另一方面，您只启动了一个线程，所以这应该不是问题。你的线程没有干扰，所以它不应该像锁定一样。我看到acquire调用，但查看所花费的时间表明它是关于最后等待线程的 join() 。

你能执行测试——比如说 50 次，并提供统计数据吗？这需要一个小时，但一分钟的测试几乎可以受到任何影响。

顺便说一句，您错过了（进口）：

import random
import math

python - Python 2.6 vs 2.7 多线程性能问题（futex）

3 回答 3

Related

Reference