gpt4 book ai didi

apache-spark - Spark/Parquet 分区是否保持顺序?

转载 作者:行者123 更新时间:2023-12-05 04:01:45 24 4
gpt4 key购买 nike

如果我对一个数据集进行分区,当我读回它时它的顺序是否正确?例如,考虑以下 pyspark 代码:

# read a csv
df = sql_context.read.csv(input_filename)

# add a hash column
hash_udf = udf(lambda customer_id: hash(customer_id) % 4, IntegerType())
df = df.withColumn('hash', hash_udf(df['customer_id']))

# write out to parquet
df.write.parquet(output_path, partitionBy=['hash'])

# read back the file
df2 = sql_context.read.parquet(output_path)

我正在对 customer_id 存储桶进行分区。当我读回整个数据集时,分区是否保证按原始插入顺序重新合并在一起?

现在,我不太确定,所以我添加了一个序列列:

df = df.withColumn('seq', monotonically_increasing_id())

不过,我不知道这是否多余。

最佳答案

不,不能保证。即使是很小的数据集也可以尝试:

df = spark.createDataFrame([(1,'a'),(2,'b'),(3,'c'),(4,'d')],['customer_id', 'name'])

# add a hash column
hash_udf = udf(lambda customer_id: hash(customer_id) % 4, IntegerType())
df = df.withColumn('hash', hash_udf(df['customer_id']))

# write out to parquet
df.write.parquet("test", partitionBy=['hash'], mode="overwrite")

# read back the file
df2 = spark.read.parquet("test")
df.show()

+-----------+----+----+
|customer_id|name|hash|
+-----------+----+----+
| 1| a| 1|
| 2| b| 2|
| 3| c| 3|
| 4| d| 0|
+-----------+----+----+
df2.show()

+-----------+----+----+
|customer_id|name|hash|
+-----------+----+----+
| 2| b| 2|
| 1| a| 1|
| 4| d| 0|
| 3| c| 3|
+-----------+----+----+

关于apache-spark - Spark/Parquet 分区是否保持顺序?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55054306/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com