gpt4 book ai didi

apache-spark - Spark的Partition Pruning和Predicate Pushdown有什么区别?

转载 作者:行者123 更新时间:2023-12-04 04:16:43 27 4
gpt4 key购买 nike

我正在研究 Spark 优化方法,并遇到了各种实现优化的方法。但是有两个名字引起了我的注意。

  1. 分区修剪
  2. 谓词下推

他们说:

分区修剪:

Partition pruning is a performance optimization that limits the number of files and partitions that Spark reads when querying. After partitioning the data, queries that match certain partition filter criteria improve performance by allowing Spark to only read a subset of the directories and files.

谓词下推:

Spark will attempt to move filtering of data as close to the source as possible to avoid loading unnecessary data into memory. Parquet and ORC files maintain various stats about each column in different chunks aof data (such as min and max values). Programs reading these files can use these indexes to determine if certain chunks, and even entire files, need to be read at all. This allows programs to potentially skip over huge portions of the data during processing.

通过阅读上述概念,它们似乎在做同样的事情,即应用满足查询中给定谓词的读取语句(查询)。分区修剪和谓词下推是不同的概念还是我以错误的方式看待它们?

最佳答案

区别在于谁应用了优化,在哪里应用了优化,以及它可以应用到哪些数据源。

  • 分区修剪由 Spark 本身应用,然后再委托(delegate)给处理文件格式的数据源。它仅适用于基于文件的格式,因为数据源还没有分区发现的概念。

  • 谓词下推将行过滤委托(delegate)给负责处理特定格式的数据源(Spark 对数据源类型的术语)。谓词下推可用于基于文件和非基于文件的源,例如 RDBMS 和 NoSQL 数据库。

关于apache-spark - Spark的Partition Pruning和Predicate Pushdown有什么区别?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60613615/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com