python - 我可以在 Dask Dataframes 上懒惰地（或同时执行） .set_index() 吗？

Question

tl; dr：

是否可以同时并行.set_index()处理多个 Dask 数据帧？或者，是否可以懒惰地处理几个 Dask Dataframe，从而导致同时并行设置索引？.set_index()

这是场景：

我有几个时间序列
每个时间序列存储的是几个.csv文件。每个文件都包含与特定日期相关的数据。此外，文件分散在不同的文件夹中（每个文件夹包含一个月的数据）
每个时间序列都有不同的采样率
所有时间序列都有相同的列。都有一列，其中包含DateTime等。
数据太大，无法在内存中处理。这就是我使用 Dask 的原因。
我想将所有时间序列合并到一个 DataFrame 中，由DateTime. 为此，我需要首先将resample()每个时间序列设置为一个共同的采样率。然后.join()是所有时间序列。
.resample()只能应用于索引。因此，在重新采样之前，我需要.set_index()在每个时间序列的 DateTime 列上。
当我.set_index()在一个时间序列上询问方法时，计算立即开始。这导致我的代码被阻止并等待。此时，如果我检查我的机器资源使用情况，我可以看到许多内核正在使用，但使用率没有超过 ~15%。这让我认为，理想情况下，我可以将该.set_index()方法同时应用于多个时间序列。

达到上述情况后，我尝试了一些不优雅的解决方案来并行.set_index()处理多个时间序列上的方法应用（例如 create a multiprocessing.Pool），但均未成功。在提供更多详细信息之前，是否有解决上述情况的干净方法？在实施 Dask 时是否考虑过上述场景？

或者，可以.set_index()偷懒吗？如果.set_index()可以懒惰地应用方法，我将使用上述步骤创建一个完整的计算图，最后，所有内容都将同时并行计算（我认为）。

score 1 · Accepted Answer

Dask.dataframe 需要知道数据帧所有分区的最小值和最大值，以便明智地并行执行日期时间操作。默认情况下，它会读取一次数据以找到好的分区。如果数据未排序，它将进行随机排序（可能非常昂贵）进行排序

在您的情况下，听起来您的数据已经排序，并且您可能能够明确提供这些数据。您应该查看文档字符串的最后一个dd.DataFrame.set_index示例

    A common case is when we have a datetime column that we know to be
    sorted and is cleanly divided by day.  We can set this index for free
    by specifying both that the column is pre-sorted and the particular
    divisions along which is is separated

    >>> import pandas as pd
    >>> divisions = pd.date_range('2000', '2010', freq='1D')
    >>> df2 = df.set_index('timestamp', sorted=True, divisions=divisions)  # doctest: +SKIP

python - 我可以在 Dask Dataframes 上懒惰地（或同时执行） .set_index() 吗？

1 回答 1

Related

Reference