gpt4 book ai didi

java - Spark : how does partitionBy (DataFrameWriter) actually work?

转载 作者:行者123 更新时间:2023-12-01 19:35:09 24 4
gpt4 key购买 nike

我一直在使用 partitionBy 但我不太明白为什么我们应该使用它。

我有一个像这样的 csv 记录:

--------------------------- ---------
name | age | entranceDate | dropDate |
--------------------------------------
Tom | 12 | 2019-10-01 | null |
--------------------------------------
Mary | 15 | 2019-10-01 | null |
--------------------------------------

如果我使用会发生什么:

String[] partitions =
new String[] {
"name",
"entranceDate"
};

df.write()
.partitionBy(partitions)
.mode(SaveMode.Append)
.parquet(parquetPath);

如果我对 null 列进行分区会怎样:

String[] partitions =
new String[] {
"name",
"dropDate"
};

df.write()
.partitionBy(partitions)
.mode(SaveMode.Append)
.parquet(parquetPath);

谁能解释一下它是如何工作的吗?谢谢。

最佳答案

df.write.partitionBy 的行为如下:

  • For every partition of the dataframe, get the unique values of the columns in partitionBy argument
  • Write the data for every unique combination in a different file

在上面的示例中,假设您的数据框有 10 个分区。我们假设分区 1-5 有 5 个唯一的名称和进入日期组合,分区 6-10 有 10 个唯一的名称和进入日期组合。姓名和入学日期的每种组合都将写入不同的文件。因此,分区 1-5 每个将写入 5 个文件,分区 6-10 每个将被拆分为 10 个文件。写入操作生成的文件总数将为 5*5 + 5*10 = 75。partitionBy 查看列组合的唯一值。来自api的文档:

Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like: - year=2016/month=01/ - year=2016/month=02/

Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.

This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark 2.1.0.

partitionBy 子句中的某一列对于所有行都具有相同的值,则数据将根据 partitionBy 参数中其他列的值进行拆分。

关于java - Spark : how does partitionBy (DataFrameWriter) actually work?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58059462/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com