python - 什么是 pytables 中的“从表中选择 max(column)”的等价物

Question

我有一个包含大量数值的表，我知道我可以提取列并对其执行 max()，但可能有一种方法可以使用内核方法来执行此操作。只是似乎无法找到它。

score 6 · Accepted Answer

在我所做的测试中，您可以使用 iterrows 方法而不是 where：

In [117]: timeit max(row['timestamp'] for row in table.iterrows(stop=1000000))
1 loops, best of 3: 1 s per loop

In [118]: timeit max(row['timestamp'] for row in table.where('(timestamp<=Tf)'))
1 loops, best of 3: 2.21 s per loop

In [120]: timeit max(frames.cols.timestamp[:1000000])
1 loops, best of 3: 974 ms per loop

In [121]: timeit np.max(frames.cols.timestamp[:1000000])
1 loops, best of 3: 876 ms per loop

请注意，上面的 Tf 是该列的 1000000 条目（这是一个 Float64）。

由于该问题不要求进行比较检查，因此可以省去 where 测试...请注意，问题中提出的方法（将数据加载为 numpy 数组）仍然更快（尽管差异小于 3% 和对于更大的数据集变得更小，我没有测试超过 10^7 行）。我发现使用 max numpy 函数的最佳结果（见上文）。

我也很乐意学习更有效的方法！

score 3 · Accepted Answer

我发现最快的方法是在你感兴趣的列上索引你的表：

table.cols.timestamp.createCSIndex()

一旦被索引，获得最大值几乎是即时的：

max_timestamp = table.cols.timestamp[table.colindexes['timestamp'][-1]]

这将首先从表的 Index 对象中获取时间戳列 ( table.colindexes['timestamp'][-1]) 的最后一个（对应于最大时间戳）行索引，然后它会通过索引到相应的列引用 ( table.cols.timestamp) 来获取它指向的行。

score 2 · Accepted Answer

来自PyTables & Family 的高性能数据管理(pdf)：

e = sum(row['col1'] for row in table.where(3<table.cols.col2<=20))

修改它以使用max()：

e = max(row['col1'] for row in table.where(3<table.cols.col2<=20))

python - 什么是 pytables 中的“从表中选择 max(column)”的等价物

3 回答 3

Related

Reference