gpt4 book ai didi

apache-spark - SparkSQL : How to specify partitioning column while loading dataset from database

转载 作者:行者123 更新时间:2023-12-05 03:03:27 25 4
gpt4 key购买 nike

我正在使用 Spark 2.3,并使用 jdbc 从 MySQL 加载数据,如下所示

  val dataSet:Dataset[Row] = _spark
.read
.format("jdbc")
.options(Map("url" -> jdbcUrl
,"user" -> username
,"password" -> password
,"dbtable" -> dataSourceTableName
,"driver" -> driver
))
.load()

我想根据表中的特定列对数据集进行分区。我怎样才能做到这一点?

最佳答案

您需要指定partitionColumnupperBoundlowerBoundnumPartitions 选项。

这些在 JDBC documentation for spark sql 的属性表中进行了描述.

These options must all be specified if any of them is specified. In addition, numPartitions must be specified. They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric, date, or timestamp column from the table in question. Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading.

关于 upperBoundlowerBound 参数的进一步解释可以找到@PIYUSH PASARI 的 answer .

他给出了以下使用以下参数值生成的查询示例

upperBound = 500, lowerBound = 0 and numPartitions = 5.

SELECT * FROM table WHERE partitionColumn < 100 or partitionColumn is null
SELECT * FROM table WHERE partitionColumn >= 100 AND <200
SELECT * FROM table WHERE partitionColumn >= 200 AND <300
SELECT * FROM table WHERE partitionColumn >= 300 AND <400
...
SELECT * FROM table WHERE partitionColumn >= 400

这可以从JDBCRelation.scala中的代码看出.

如您所见,所有行都已提取,但如果您的上限和下限未覆盖整个数据范围,则第一个和最后一个分区可能比其他分区大。如果您不能确定上限和下限,甚至想要分区并且不关心获取每一行,您可以始终将上限和下限设置为 dbtable 参数中的条件。

关于apache-spark - SparkSQL : How to specify partitioning column while loading dataset from database,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53928073/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com