python - 是否有更惯用的方法来根据列的内容从 PyArrow 表中选择行？-6ren

python - 是否有更惯用的方法来根据列的内容从 PyArrow 表中选择行？

转载作者：行者123 更新时间：2023-12-03 08:33:39

28

4

我有一个很大的 PyArrow 表，其中有一个名为 index 的列，我想用它来对表进行分区； index 的每个单独值代表表中的不同数量。

是否有一种惯用的方法可以根据列的内容从 PyArrow 表中选择行？

这是一个示例表:

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np

# Example table for data schema
irow = np.arange(2**20)
dt = 17
df0 = pd.DataFrame({'timestamp': np.array((irow//2)*dt, dtype=np.int64),
                   'index':     np.array(irow%2, dtype=np.int16),
                   'value':     np.array(irow*0, dtype=np.int32)},
                   columns=['timestamp','index','value'])
ii = df0['index'] == 0
df0.loc[ii,'value'] = irow[ii]//2
ii = df0['index'] == 1
df0.loc[ii,'value'] = (np.sin(df0.loc[ii,'timestamp']*0.01)*10000).astype(np.int32)
table0 = pa.Table.from_pandas(df0)
print(df0)

# prints the following:
         timestamp  index   value
0                0      0       0
1                0      1       0
2               17      0       1
3               17      1    1691
4               34      0       2
...            ...    ...     ...
1048571    8912845      1    9945
1048572    8912862      0  524286
1048573    8912862      1    9978
1048574    8912879      0  524287
1048575    8912879      1    9723

[1048576 rows x 3 columns]

在 Pandas 中进行此选择非常容易:

print(df0[df0['index']==1])

# prints the following
         timestamp  index  value
1                0      1      0
3               17      1   1691
5               34      1   3334
7               51      1   4881
9               68      1   6287
...            ...    ...    ...
1048567    8912811      1   9028
1048569    8912828      1   9625
1048571    8912845      1   9945
1048573    8912862      1   9978
1048575    8912879      1   9723

[524288 rows x 3 columns]

但是对于 PyArrow，我必须在 PyArrow 和 numpy 或 pandas 之间进行一些调整:

value_index = table0.column('index').to_numpy()
# get values of the index column, convert to numpy format
row_indices = np.nonzero(value_index==1)[0]
# find matches and get their indices
selected_table = table0.take(pa.array(row_indices))
# use take() with those indices
v = selected_table.column('value')
print(v.to_numpy())

# which prints
[   0 1691 3334 ... 9945 9978 9723]

有没有更直接的方法？

最佳答案

执行 bool 过滤操作不需要转换为 numpy。您可以使用 pyarrow.compute 模块中的 equal 和 filter 函数来实现此目的:

import pyarrow.compute as pc

value_index = table0.column('index')
row_mask = pc.equal(value_index, pa.scalar(1, value_index.type))
selected_table = table0.filter(row_mask)

关于python - 是否有更惯用的方法来根据列的内容从 PyArrow 表中选择行？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/64578761/

28

4

0

文章推荐： haskell - 单子(monad)定律的解释

文章推荐： ruby-on-rails - simple_forms 自定义数据属性

文章推荐： javascript - 设置 Canvas 文本的字体粗细

clojure - 惯用 Clojure
我对 Clojure 和函数式编程有了大约一周的了解——我的所有背景都是 OOP。我想利用 Clojure 备受争议的易读性和固有逻辑，但现在我不知道我是否成功地做到了这一点，只是没有完全理解它，或者
scala - 惯用 "do until"集合更新
场景: val col: IndexedSeq[Array[Char]] = for (i = 0 && arr(last.y)(west) == '.') { arr(last.y)(w
angularjs - 从 AngularJS 中的服务刷新数据的最佳(惯用)方法
我正面临 AngularJS、服务和范围的“问题”。这不是一个真正的问题(我找到了几种使其工作的方法)，但我想知道我是否在做正确的事情，或者我正在做的事情是否会导致将来出现问题我有一个保存一些全局
ruby - 惯用 ruby : data structure transformation
进行以下数据结构转换的“Rubyist”方法是什么: 我有 incoming = [ {:date => 20090501, :width => 2}, {:
algorithm - 惯用 Go 中 set 的最小值
如何在 go 中编写返回集合最小值的函数？我不只是在寻找解决方案(我知道我可以在遍历第一个元素时只初始化最小值，然后设置一个我初始化最小值的 bool 变量)，而是一个惯用的解决方案。由于 go 没有
java - 惯用 scala 中 jvm 选项的指南
好的，我知道我应该对我的特定应用程序进行基准测试，等等，但是: -Xmx 的默认 JVM 设置、默认垃圾收集器等，对于大多数典型的 Java 程序来说是合理的默认设置，并且可能不适合惯用的 Scala
c++ - 惯用 std::auto_ptr 还是只使用 shared_ptr？
既然 shared_ptr 在 tr1 中，你认为 std::auto_ptr 的使用会发生什么？它们都有不同的用例，但 auto_ptr 的所有用例也都可以用 shared_ptr 解决。你会放弃
python - 检查 Python 变量类型的最佳(惯用)方法是什么？
这个问题在这里已经有了答案: What are the differences between type() and isinstance()? (8 个回答) 关闭 9 年前。我需要知道 Pyth
c# - 惯用 C# : when to return null and when to return NaN
在指定和创建数字函数时，是否有关于何时返回 null 以及何时返回 NaN 的任何 C# 惯用准则，当两者似乎都是有效输出时。导致这个问题的具体例子是我正在为 Enumerable 集合创建一个百分
c++ - "Best"(惯用)从 C++ 容器中选择 k 个最小元素的方法
这个问题在这里已经有了答案: Retrieving the top 100 numbers from one hundred million of numbers [duplicate] (12 个
reflection - 有没有一种简单(惯用)的方法将 java.lang.reflect.Method 转换为 Scala 函数？
我可以通过反射检索方法，以某种方式将其与目标对象结合起来，并将其作为看起来像 Scala 中的函数的东西返回(即您可以使用括号调用它)吗？参数列表是可变的。它不一定是“一流”函数(我已经更新了问题)，

首页

博学

6Ren·AI

商城

python - 是否有更惯用的方法来根据列的内容从 PyArrow 表中选择行？