pyarrow - Pyarrow table memory compared to raw csv size

Question

I have a 2GB CSV file that I read into a pyarrow table with the following:

from pyarrow import csv

tbl = csv.read_csv(path)

When I call tbl.nbytes I get 3.4GB. I was surprised at how much larger the csv was in arrow memory than as a csv. Maybe I have a fundamental misunderstanding of what pyarrow is doing under the hood but I thought if anything it would be smaller due to its columnar nature (i also probably could have squeezed out more gains using ConvertOptions but i wanted a baseline). I definitely wasnt expecting an increase of almost 75%. Also when I convert it from arrow table to pandas df the df took up roughly the same amount of memory as the csv - which was expected.

Can anyone help explain the difference in memory for arrow tables compared to a csv / pandas df.

Thx.

UPDATE

Full code and output below.

In [2]: csv.read_csv(r"C:\Users\matth\OneDrive\Data\Kaggle\sf-bay-area-bike-shar
   ...: e\status.csv")
Out[2]:
pyarrow.Table
station_id: int64
bikes_available: int64
docks_available: int64
time: string

In [3]: tbl = csv.read_csv(r"C:\Users\generic\OneDrive\Data\Kaggle\sf-bay-area-bik
   ...: e-share\status.csv")

In [4]: tbl.schema
Out[4]:
station_id: int64
bikes_available: int64
docks_available: int64
time: string

In [5]: tbl.nbytes
Out[5]: 3419272022

In [6]: tbl.to_pandas().info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71984434 entries, 0 to 71984433
Data columns (total 4 columns):
 #   Column           Dtype
---  ------           -----
 0   station_id       int64
 1   bikes_available  int64
 2   docks_available  int64
 3   time             object
dtypes: int64(3), object(1)
memory usage: 2.1+ GB

score 2 · Accepted Answer

有两个问题：

整数列使用 int64，但 int32 会更合适（除非值很大）
时间列被解释为字符串。输入格式不遵循任何标准（%Y/%m/%d %H:%M:%S）没有帮助

第一个问题很容易解决，使用ConvertionOptions：

tbl = csv.read_csv(
    <path>,
    convert_options=csv.ConvertOptions(
        column_types={
            'station_id': pa.int32(),
            'bikes_available': pa.int32(),
            'docks_available': pa.int32(),
            'time': pa.string()
        }))

第二个有点复杂，因为据我所知，read_csv API 不允许您为时间列提供格式，并且没有简单的方法将字符串列转换为 pyarrow 中的日期时间。所以你必须改用熊猫：

series = tbl.column('time').to_pandas()
series_as_datetime = pd.to_datetime(series, format='%Y/%m/%d %H:%M:%S')
tbl2 = pa.table(
    {
        'station_id':tbl.column('station_id'),
        'bikes_available':tbl.column('bikes_available'),
        'docks_available':tbl.column('docks_available'),
        'time': pa.chunked_array([series_as_datetime])
    })
tbl2.nbytes
>>> 1475683759

1475683759 是您期望的数字，您再好不过了。每行是 20 个字节（4 + 4 + 4 + 8）。

pyarrow - Pyarrow table memory compared to raw csv size

1 回答 1

Related

Reference