dask - Dask For 循环并行

Question

我正在尝试找到正确的语法来使用延迟 dask 的 for 循环。我找到了几个教程和其他问题，但没有一个适合我的条件，这是非常基本的。

首先，这是并行运行 for 循环的正确方法吗？

%%time

list_names=['a','b','c','d']
keep_return=[]

@delayed
def loop_dummy(target):
    for i in range (1000000000):
        pass
    print('passed value is:'+target)
    return(1)


for i in list_names:
    c=loop_dummy(i)
    keep_return.append(c)


total = delayed(sum)(keep_return)
total.compute()

这产生了

passed value is:a
passed value is:b
passed value is:c
passed value is:d
Wall time: 1min 53s

如果我串行运行它，

%%time

list_names=['a','b','c','d']
keep_return=[]


def loop_dummy(target):
    for i in range (1000000000):
        pass
    print('passed value is:'+target)
    return(1)


for i in list_names:
    c=loop_dummy(i)
    keep_return.append(c)

它实际上更快。

passed value is:a
passed value is:b
passed value is:c
passed value is:d
Wall time: 1min 49s

我看过一些例子，其中说 Dask 有少量开销，但这似乎需要足够长的时间来证明，不是吗？

我的实际 for 循环涉及更繁重的计算，我在其中为各种目标构建模型。

score 7 · Accepted Answer

This computation

for i in range(...):
    pass

Is bound by the global interpreter lock (GIL). You will want to use the multiprocessing or dask.distributed Dask backends rather than the default threading backend. I recommend the following:

total.compute(scheduler='multiprocessing')

However, if your actual computation is mostly Numpy/Pandas/Scikit-Learn/Other numeric package code, then the default threading backend is probably the right choice.

More information about choosing between schedulers is available here: http://dask.pydata.org/en/latest/scheduling.html

dask - Dask For 循环并行

1 回答 1

Related

Reference