rust - DataFusion(Apache Arrow): How to lazily read batches of result?-6ren

rust - DataFusion(Apache Arrow): How to lazily read batches of result?

转载作者：行者123 更新时间：2023-12-03 11:38:48

25

4

我有一个数据融合查询。我不想等所有批次都待处理，而是想在第一个批次准备好后立即运行一些代码。
这是等待然后处理的代码:

let dataframe = ExecutionContext::new().read_parquet(filename)?;
let batchs = dataframe.collect().await?;

for batch in batchs {
    // Do something with the record batch
    println!("{:?}", batch.schema());
}

我希望返回的内容不是对BatchRecord数组的 promise ，而是对BatchRecord数组的 promise 。 DataFusion是否提供一种只检索第一个批次而不必等待镶木文件的完整处理的方法？
我目前在启动时有5分钟以上的加载时间，这是不切实际的。直接使用Arrow＆Parquet将允许我立即访问第一批(以api/功能交易)。
编辑:一个最小的示例可以在 DataFusion git repository中找到

最佳答案

自2.0.0版本以来，master分支中最近进行了一些更改，以更好地支持异步和流传输，因此值得检查最新代码，但DataFrame collect方法确实将所有结果加载到内存中，然后返回，因此可能不会最好的方法。
在Arrow邮件列表中询问此问题也可能是一个好主意。

关于rust - DataFusion(Apache Arrow): How to lazily read batches of result?，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/64333797/

25

4

0

文章推荐： rust - 如何在 CentOS 的 conda 上包含所需的 cargo 编译器？

文章推荐： javascript - 使用ajax将html加载到Datatable子行中

文章推荐： javascript - 从外部链接调用 JavaScript

文章推荐： ruby-on-rails - Rails 操作邮件程序 : images in emails

google-cloud-platform - 导入/导出 DataFusion 管道
有谁知道是否可以以编程方式导入/导出 DataFlow 管道(已部署或处于草稿状态)？想法是编写一个脚本来删除和创建 DataFusion 实例，以避免在不使用时计费。通过 gloud 命令行，可以
rust - 使用Arrow/Datafusion/Polars(如python panda的groupby)按列值分区？
我正在寻找方便的 python panda 语法的等价物: #df is a pandas dataframe for fruit, sub_df in df.groupby('fruits'):
rust - DataFusion(Apache Arrow): How to lazily read batches of result?
我有一个数据融合查询。我不想等所有批次都待处理，而是想在第一个批次准备好后立即运行一些代码。这是等待然后处理的代码: let dataframe = ExecutionContext::new().
Running datafusion query-aws-s3 example results to an error(运行数据融合查询-AWS-S3示例导致错误)
我专门在北京运行这个例子。我得到这个错误。我填写了我的AWS访问密钥和秘密访问密钥-没有拼写错误。。有没有人能从数据挖掘中运行这个特定的例子？

首页

博学

6Ren·AI

商城

rust - DataFusion(Apache Arrow): How to lazily read batches of result?