1

TL;DR
如果在另一个查询正在下载所需数据集时出现多个查询 - Dask 会尝试多次下载数据集吗?或者它会承认它“在飞行中”并自动等待它完成?

背景
如果我有一个刚刚启动的工作人员(尚未将数据集加载到内存中)并且我的函数要求数据集,它将根据需要下载到工作人员上。一个简单的场景:

(1) Worker boots
(2) Receives query which needs a dataset
(3) Downloads dataset (takes X seconds)
(4) Executes query

但是,如果我有以下情况:

(1) Worker boots
(2) Receives query which needs a dataset
(3) Downloads dataset (takes X seconds)
(4) Receives query which needs the same dataset which is currently downloading - will it download it again or detect in-flight?
(5) Receives another query which needs the same dataset which is currently downloading - will it download it again or detect in-flight?
(6) Execute queries

Dask 会尝试多次下载数据集,还是会承认它“正在运行”并自动等待它完成?

我已经阅读了源代码,但数据集发布/列表对我来说仍然是一个黑匣子。

4

1 回答 1

0

每次调用client.get_dataset都是独立的,多个请求会导致冗余工作。话虽如此,您不应该在数据集中存储元数据以外的任何内容(例如指向远程未来的 dask 集合),因此如果正确使用,此下载应该只需要几毫秒。

于 2017-10-03T11:58:56.597 回答