python-3.x - Modin df iterrows 非常缓慢。有什么办法可以加快速度吗？

Question

我有一个大约 120k 行的 modin 数据框。我想合并它的一些列。Modin df iterrows 需要很多时间，所以我尝试使用 numpy.where。Numpy.where 在等效的 pandas df 上可以在 5-10 分钟内完成，但 modin df 上的相同操作需要约 30 分钟。有什么替代方法可以加快 modin 数据帧的这项任务？

[cols_to_be_coalesced] --> 此列表包含要合并的列的列表。它包含 10-15 列。

代码：

for COL in [cols_to_be_coalesced]:
    df['COL'] = np.where(df['COL']!='', df['COL'], df['COL_X'])

如果 df 是 pandas 数据帧，它将在 ~10 分钟内执行，但如果它是一个 modin 数据帧，则需要 ~30 分钟。那么，对于 modin 数据帧，numpy.where 是否有任何等效代码来加速此操作？

score 0 · Accepted Answer

我认为你np.where的速度很慢，因为np.where将 Modin 数据帧转换为 numpy 数组，而将 Modin 数据帧转换为 numpy 很慢。这个版本使用pandas.Series.where（不是 Modinwhere实现，因为尚未添加）对您来说更快吗？

for COL in [cols_to_be_coalesced]:
    df['COL'] = df['COL'].where(df['COL'] != '', df['COL_X'])

我发现该方法需要 1.58 秒，而本示例中的原始方法需要 70 秒：

import modin.pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 100, size=(2**20, 2**8))).add_prefix("col")
# setting column with np.where takes 70 seconds
df['col1'] = np.where(df['col1'] % 2 == 0, df['col1'], df['col2'])
# setting column with pandas.Series.where takes 1.58 seconds
df['col1'] = df['col1'].where(df['col1'] % 2 == 0, df['col2'])

python-3.x - Modin df iterrows 非常缓慢。有什么办法可以加快速度吗？

1 回答 1

Related

Reference