gpt4 book ai didi

hadoop - Spark 分区修剪在 1.6.0 上不起作用

转载 作者:可可西里 更新时间:2023-11-01 15:27:15 26 4
gpt4 key购买 nike

我在 hdfs 上创建了分区的 parquet 文件并创建了 HIVE 外部表。当我在分区列上使用过滤器查询表时,spark 检查所有分区文件而不是特定分区。我们使用的是 spark 1.6.0。

数据框:

   df = hivecontext.createDataFrame([
("class1", "Economics", "name1", None),
("class2","Economics", "name2", 92),
("class2","CS", "name2", 92),
("class1","CS", "name1", 92)
], ["class","subject", "name", "marks"])

创建 Parquet 分区:

hivecontext.setConf("spark.sql.parquet.compression.codec", "snappy")
hivecontext.setConf("spark.sql.hive.convertMetastoreParquet", "false")
df1.write.parquet("/transient/testing/students", mode="overwrite", partitionBy='subject')

查询:

df = hivecontext.sql('select * from vatmatching_stage.students where subject = "Economics"')
df.show()

+------+-----+-----+---------+
| class| name|marks| subject|
+------+-----+-----+---------+
|class1|name1| 0|Economics|
|class2|name2| 92|Economics|
+------+-----+-----+---------+

df.explain(True)

== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
+- 'Filter ('subject = Economics)
+- 'UnresolvedRelation `vatmatching_stage`.`students`, None

== Analyzed Logical Plan ==
class: string, name: string, marks: bigint, subject: string
Project [class#90,name#91,marks#92L,subject#89]
+- Filter (subject#89 = Economics)
+- Subquery students
+- Relation[class#90,name#91,marks#92L,subject#89] ParquetRelation: vatmatching_stage.students

== Optimized Logical Plan ==
Project [class#90,name#91,marks#92L,subject#89]
+- Filter (subject#89 = Economics)
+- Relation[class#90,name#91,marks#92L,subject#89] ParquetRelation: vatmatching_stage.students

== Physical Plan ==
Scan ParquetRelation: vatmatching_stage.students[class#90,name#91,marks#92L,subject#89] InputPaths: hdfs://dev4/transient/testing/students/subject=Art, hdfs://dev4/transient/testing/students/subject=Civil, hdfs://dev4/transient/testing/students/subject=CS, hdfs://dev4/transient/testing/students/subject=Economics, hdfs://dev4/transient/testing/students/subject=Music

但是,如果我在 HIVE 浏览器上执行相同的查询,我们可以看到 HIVE 正在执行分区修剪。

44 location hdfs://testing/students/subject=Economics
45 name vatmatching_stage.students
46 numFiles 1
47 numRows -1
48 partition_columns subject
49 partition_columns.types string

这是 spark 1.6.0 中的限制还是我在这里遗漏了什么。

最佳答案

找到了这个问题的根本原因。用于查询表的 HiveContext 没有将 spark.sql.hive.convertMetastoreParquet”设置为“false”。它设置为“true”- 默认值。

当我将它设置为“false”时,我可以看到它正在使用分区修剪。

关于hadoop - Spark 分区修剪在 1.6.0 上不起作用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42622275/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com