gpt4 book ai didi

apache-spark - 为什么我的 parquet 分区数据比非分区数据慢?

转载 作者:行者123 更新时间:2023-12-05 05:16:58 27 4
gpt4 key购买 nike

我的理解是:如果我在一个列上对数据进行分区,我将通过它进行查询应该会更快。但是,当我尝试它时,它似乎变慢了,为什么?

我有一个用户数据框,我尝试对 yearmonth 进行分区,但没有。

所以我有 1 个按 creation_yearmonth 分区的数据集。

questionsCleanedDf.repartition("creation_yearmonth") \
.write.partitionBy('creation_yearmonth') \
.parquet('wasb://.../parquet/questions.parquet')

我还有一个没有分区

questionsCleanedDf \
.write \
.parquet('wasb://.../parquet/questions_nopartition.parquet')

然后我尝试从这 2 个 Parquet 文件创建数据框并运行相同的查询

questionsDf = spark.read.parquet('wasb://.../parquet/questions.parquet')

questionsDf = spark.read.parquet('wasb://.../parquet/questions_nopartition.parquet')

查询

spark.sql("""
SELECT * FROM questions
WHERE creation_yearmonth = 201606
""")

似乎无分区的一贯更快或具有相似的时间(~2 - 3 秒),而分区的一个稍微慢一点(~3 - 4 秒)。

我试着做了一个解释:

对于分区数据集:

== Physical Plan ==
*FileScan parquet [id#6404,title#6405,tags#6406,owner_user_id#6407,accepted_answer_id#6408,view_count#6409,answer_count#6410,comment_count#6411,creation_date#6412,favorite_count#6413,creation_yearmonth#6414] Batched: false, Format: Parquet, Location: InMemoryFileIndex[wasb://data@cs4225.blob.core.windows.net/parquet/questions.parquet], PartitionCount: 1, PartitionFilters: [isnotnull(creation_yearmonth#6414), (creation_yearmonth#6414 = 201606)], PushedFilters: [], ReadSchema: struct<id:int,title:string,tags:array<string>,owner_user_id:int,accepted_answer_id:int,view_count...

PartitionCount: 1 我应该因为在这种情况下,它可以直接进入分区应该更快?

对于非分区的:

== Physical Plan ==
*Project [id#6440, title#6441, tags#6442, owner_user_id#6443, accepted_answer_id#6444, view_count#6445, answer_count#6446, comment_count#6447, creation_date#6448, favorite_count#6449, creation_yearmonth#6450]
+- *Filter (isnotnull(creation_yearmonth#6450) && (creation_yearmonth#6450 = 201606))
+- *FileScan parquet [id#6440,title#6441,tags#6442,owner_user_id#6443,accepted_answer_id#6444,view_count#6445,answer_count#6446,comment_count#6447,creation_date#6448,favorite_count#6449,creation_yearmonth#6450] Batched: false, Format: Parquet, Location: InMemoryFileIndex[wasb://data@cs4225.blob.core.windows.net/parquet/questions_nopartition.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(creation_yearmonth), EqualTo(creation_yearmonth,201606)], ReadSchema: struct<id:int,title:string,tags:array<string>,owner_user_id:int,accepted_answer_id:int,view_count...

也很意外。起初数据集有日期作为字符串,所以我需要做这样的查询:

spark.sql("""
SELECT * FROM questions
WHERE CAST(creation_date AS date) BETWEEN '2017-06-01' AND '2017-07-01'
""").show(20, False)

我原以为这会更慢,但事实证明,它的性能最好~1-2 秒。这是为什么?我想在这种情况下,它需要转换每一行吗?

此处的解释输出:

== Physical Plan ==
*Project [id#6521, title#6522, tags#6523, owner_user_id#6524, accepted_answer_id#6525, view_count#6526, answer_count#6527, comment_count#6528, creation_date#6529, favorite_count#6530]
+- *Filter ((isnotnull(creation_date#6529) && (cast(cast(creation_date#6529 as date) as string) >= 2017-06-01)) && (cast(cast(creation_date#6529 as date) as string) <= 2017-07-01))
+- *FileScan parquet [id#6521,title#6522,tags#6523,owner_user_id#6524,accepted_answer_id#6525,view_count#6526,answer_count#6527,comment_count#6528,creation_date#6529,favorite_count#6530] Batched: false, Format: Parquet, Location: InMemoryFileIndex[wasb://data@cs4225.blob.core.windows.net/filtered/questions.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(creation_date)], ReadSchema: struct<id:string,title:string,tags:array<string>,owner_user_id:string,accepted_answer_id:string,v...

最佳答案

过度分区实际上会降低性能:

If a column has only a few rows matching each value, the number of directories to process can become a limiting factor, and the data file in each directory could be too small to take advantage of the Hadoop mechanism for transmitting data in multi-megabyte blocks.

这段摘录自另一个 Hadoop 组件的文档,Impala ,但提出的论点应该对 Hadoop 堆栈的所有组件都有效。

我认为无论使用何种分区方案,分区的优势在表增长到超过 900 MB-s 时才会显现。

关于apache-spark - 为什么我的 parquet 分区数据比非分区数据慢?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49776752/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com