apache-spark - Spark的Partition Pruning和Predicate Pushdown有什么区别？-6ren

apache-spark - Spark的Partition Pruning和Predicate Pushdown有什么区别？

转载作者：行者123 更新时间：2023-12-04 04:16:43

27

4

我正在研究 Spark 优化方法，并遇到了各种实现优化的方法。但是有两个名字引起了我的注意。

分区修剪
谓词下推

他们说:

分区修剪:

Partition pruning is a performance optimization that limits the number of files and partitions that Spark reads when querying. After partitioning the data, queries that match certain partition filter criteria improve performance by allowing Spark to only read a subset of the directories and files.

谓词下推:

Spark will attempt to move filtering of data as close to the source as possible to avoid loading unnecessary data into memory. Parquet and ORC files maintain various stats about each column in different chunks aof data (such as min and max values). Programs reading these files can use these indexes to determine if certain chunks, and even entire files, need to be read at all. This allows programs to potentially skip over huge portions of the data during processing.

通过阅读上述概念，它们似乎在做同样的事情，即应用满足查询中给定谓词的读取语句(查询)。分区修剪和谓词下推是不同的概念还是我以错误的方式看待它们？

最佳答案

区别在于谁应用了优化，在哪里应用了优化，以及它可以应用到哪些数据源。

分区修剪由 Spark 本身应用，然后再委托(delegate)给处理文件格式的数据源。它仅适用于基于文件的格式，因为数据源还没有分区发现的概念。
谓词下推将行过滤委托(delegate)给负责处理特定格式的数据源(Spark 对数据源类型的术语)。谓词下推可用于基于文件和非基于文件的源，例如 RDBMS 和 NoSQL 数据库。

关于apache-spark - Spark的Partition Pruning和Predicate Pushdown有什么区别？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/60613615/

27

4

0

文章推荐： c# - Visual Studio 2019 未将引用的项目添加到 deps.json

文章推荐： laravel - 如何测试 Laravel nova Action ？

文章推荐： routes - TYPO3 路由 : how does Custom Aspect works?

apache-spark - "predicate pushdown"和 "projection pushdown"有什么区别？
我接触过多种信息来源，例如找到的一个 here ，将“谓词下推”解释为: … if you can “push down” parts of the query to where the data i
pushdown-automaton - 下推自动机可以有零个最终状态吗？
根据问题，下推自动机可以具有零最终状态吗？最佳答案是的! PDA 有很多不同的定义，但通常定义是说一个 PDA 有一组接受状态，它必须是 PDA 中所有状态的集合的子集。空集是有效集，因此 PDA
MySQL Index Condition Pushdown(ICP)性能优化方法实例
一概念介绍 Index Condition Pushdown (ICP)是MySQL 5.6 版本中的新特性,是一种在存储引擎层使用索引过滤数据的一种优化方式。 a 当关闭ICP时,index
How to pushdown partitioning when reading from jdbc Spark(如何在从jdbc Spark读取时下推分区)
我希望通过JDBC连接从表中读取数据，既使用WHERE子句进行过滤，又使用选项artitionColumn、lowerBound、upperbound、numPartitions对另一列进行分区。。目
scala - Spark Filter/Predicate Pushdown 是否在 ORC 文件中没有按预期工作？
虽然“spark.sql.orc.filterPushdown”等于 false(默认情况下)。以下语句需要 3 分钟才能执行。 val result = spark.read.schema(sche
hadoop - Tez Pushdown Predicate 上的 Hive 在分区表上使用窗口函数的 View 中不起作用
在 Tez 上使用 Hive 针对此 View 运行此查询会导致全表扫描，即使在 regionid 和 id 上存在分区也是如此。 Cloudera Impala 中的这个查询需要 0.6 秒才能完成
调整大小时的 CSS Float : unwanted pushdown of element (cause of margin/padding) (responsive design)
我有以下问题: 我的标题由菜单、一些使用 flexslider 滑动的新闻和一个搜索输入组成: 问题:当我手动调整浏览器的大小，或者以更小的宽度重新加载时；右侧元素向下跳(下推，因为没有剩余空间) 这

首页

博学

6Ren·AI

商城

apache-spark - Spark的Partition Pruning和Predicate Pushdown有什么区别？