1

我正在尝试使用 to_parquet api 中的 pyarrow 引擎将 dask 数据帧写入 hdfs parquet。

但写入失败,但有以下异常:

dask_df.to_parquet(parquet_path,engine=engine)
  文件“/ebs/d1/agent/miniconda3/envs/dask-distributed/lib/python3.6/site-packages/dask/dataframe/core.py”,第 985 行,在 to_parquet
    返回到_parquet(自我,路径,*args,**kwargs)
  文件“/ebs/d1/agent/miniconda3/envs/dask-distributed/lib/python3.6/site-packages/dask/dataframe/io/parquet.py”,第 618 行,在 to_parquet
    out.compute()
  文件“/ebs/d1/agent/miniconda3/envs/dask-distributed/lib/python3.6/site-packages/dask/base.py”,第 135 行,在计算中
    (结果,) = 计算(self, traverse=False, **kwargs)
  计算中的文件“/ebs/d1/agent/miniconda3/envs/dask-distributed/lib/python3.6/site-packages/dask/base.py”,第 333 行
    结果 = 获取(dsk,键,**kwargs)
  文件“/ebs/d1/agent/miniconda3/envs/dask-distributed/lib/python3.6/site-packages/distributed/client.py”,第 1999 行,在 get
    结果 = self.gather(打包,异步=异步)
  文件“/ebs/d1/agent/miniconda3/envs/dask-distributed/lib/python3.6/site-packages/distributed/client.py”,第 1437 行,聚集
    异步=异步)
  文件“/ebs/d1/agent/miniconda3/envs/dask-distributed/lib/python3.6/site-packages/distributed/client.py”,第 592 行,同步
    返回同步(self.loop,func,*args,**kwargs)
  文件“/ebs/d1/agent/miniconda3/envs/dask-distributed/lib/python3.6/site-packages/distributed/utils.py”,第 254 行,同步
    六.reraise(*error[0])
  文件“/ebs/d1/agent/miniconda3/envs/dask-distributed/lib/python3.6/site-packages/six.py”,第 693 行,在 reraise
    升值
  文件“/ebs/d1/agent/miniconda3/envs/dask-distributed/lib/python3.6/site-packages/distributed/utils.py”,第 238 行,在 f
    结果[0] = 产量 make_coro()
  运行中的文件“/ebs/d1/agent/miniconda3/envs/dask-distributed/lib/python3.6/site-packages/tornado/gen.py”,第 1055 行
    价值 = 未来。结果()
  文件“/ebs/d1/agent/miniconda3/envs/dask-distributed/lib/python3.6/site-packages/tornado/concurrent.py”,第 238 行,结果
    raise_exc_info(self._exc_info)
  文件“”,第 4 行,在 raise_exc_info
  运行中的文件“/ebs/d1/agent/miniconda3/envs/dask-distributed/lib/python3.6/site-packages/tornado/gen.py”,第 1063 行
    产生 = self.gen.throw(*exc_info)
  _gather 中的文件“/ebs/d1/agent/miniconda3/envs/dask-distributed/lib/python3.6/site-packages/distributed/client.py”,第 1315 行
    追溯)
  文件“/ebs/d1/agent/miniconda3/envs/dask-distributed/lib/python3.6/site-packages/six.py”,第 692 行,在 reraise
    提高 value.with_traceback(tb)
  _write_partition_pyarrow 中的文件“/ebs/d1/agent/miniconda3/envs/dask-distributed/lib/python3.6/site-packages/dask/dataframe/io/parquet.py”,第 410 行
    将 pyarrow 导入为 pa
  文件“/ebs/d1/agent/miniconda3/envs/dask-distributed/lib/python3.6/site-packages/pyarrow/__init__.py”,第 113 行,在
    将 pyarrow.hdfs 导入为 hdfs
AttributeError:模块“pyarrow”没有属性“hdfs”

pyarrow 版本:0.8.0 和分布式版本:1.20.2

但是当我尝试在 python 控制台中导入包时,它没有任何错误:

将 pyarrow.hdfs 导入为 hdfs

4

1 回答 1

0

该错误来自您的一位工人。也许您的客户端机器和工作人员有不同的环境?

于 2018-03-22T12:21:46.707 回答