apache-spark - SparkSQL : How to specify partitioning column while loading dataset from database-6ren

apache-spark - SparkSQL : How to specify partitioning column while loading dataset from database

转载作者：行者123 更新时间：2023-12-05 03:03:27

25

4

我正在使用 Spark 2.3，并使用 jdbc 从 MySQL 加载数据，如下所示

  val dataSet:Dataset[Row] = _spark
    .read
    .format("jdbc")
    .options(Map("url" -> jdbcUrl
                ,"user" -> username
                ,"password" -> password
                ,"dbtable" -> dataSourceTableName
                ,"driver" -> driver
                ))
    .load()

我想根据表中的特定列对数据集进行分区。我怎样才能做到这一点？

最佳答案

您需要指定partitionColumn、upperBound、lowerBound 和numPartitions 选项。

这些在 JDBC documentation for spark sql 的属性表中进行了描述.

These options must all be specified if any of them is specified. In addition, numPartitions must be specified. They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric, date, or timestamp column from the table in question. Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading.

关于 upperBound 和 lowerBound 参数的进一步解释可以找到@PIYUSH PASARI 的 answer .

他给出了以下使用以下参数值生成的查询示例

upperBound = 500, lowerBound = 0 and numPartitions = 5.

SELECT * FROM table WHERE partitionColumn < 100 or partitionColumn is null
SELECT * FROM table WHERE partitionColumn >= 100 AND <200 
SELECT * FROM table WHERE partitionColumn >= 200 AND <300
SELECT * FROM table WHERE partitionColumn >= 300 AND <400
...
SELECT * FROM table WHERE partitionColumn >= 400

这可以从JDBCRelation.scala中的代码看出.

如您所见，所有行都已提取，但如果您的上限和下限未覆盖整个数据范围，则第一个和最后一个分区可能比其他分区大。如果您不能确定上限和下限，甚至想要分区并且不关心获取每一行，您可以始终将上限和下限设置为 dbtable 参数中的条件。

关于apache-spark - SparkSQL : How to specify partitioning column while loading dataset from database，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53928073/

25

4

0

文章推荐： php - 我尝试在我的 laragon 中更新 php 但我得到了这个

文章推荐： spring - Micrometer - 特定指标的通用标签

文章推荐： rest - 无法为 Azure 文件创建有效的共享访问签名

CSS Percent size specifier sizing element to more than specified size
在 CSS 中，我从来没有真正理解为什么会发生这种情况，但每当我为某物分配 margin-top:50% 时，该元素就会被推到页面底部，几乎完全消失这一页。我假设 50% 时，该元素将位于页面的中间位
mongodb - 初始化应用程序时出错 : No datastore implementation specified Message: No datastore implementation specified
我想在 MongoDB 中使用 Grails2.5 中的“ElasticSearch”插件。我的“BuildConfig.groovy”文件是: grails.servlet.version = "3
r - 简单的 R 任务 : divide specified columns by 1000 at specified rows
我有一个我想要处理的 OHLC 股票报价数组。 Open High Low Close Volume 2003-01-05
java - SQLException : Column AD not in specified tables - column AD never specified
我尝试创建一个PreparedStatement: stmt = conn.prepareStatement("SELECT POLBRP, POLTYP, POLNOP, INCPTP, TRMTH
CMake 添加子目录错误 : "When specifying an out-of-tree source a binary directory must be explicitly specified"
我的目录结构如下: root libA CMakeLists.txt ClassA.cpp libB CMakeLists.txt ClassB.cpp s
postgresql - Postgis 数据库 : How can I get all gps points between specified timestamps and specified region?
我是 DBMS 的新手。我在每个用户的不同 csv 文件中都有车辆痕迹。格式:名称，时间戳，纬度，经度，randomId。例如:user0,2008-10-2309:42:25,441972.6942
android - eclipse : Specify multiple res directories like specifying multiple src directories
我需要为我的应用程序打上烙印，并且只需要自定义少量图像，代码库是相同的，只是生成的常量很少。由于aapt 允许指定许多资源目录，有没有办法在Eclipse .classpath 文件中指定res 目
java.lang.IllegalArgumentException : A signing key must be specified if the specified JWT is digitally signed 异常
我希望在我的应用程序中实现 JWT，因为我正在通过引用以下内容对其进行一些研发:https://stormpath.com/blog/jwt-java-create-verify .当我尝试通过提取声
ios - 应用程序崩溃。日志显示原因 : 'Invalid query. You must not specify a starting point before specifying the order by.'
我正在尝试通过设置限制获取数据并根据时间戳对数据进行排序，但在运行应用程序时崩溃并显示此错误消息: 查询无效。在指定顺序之前不得指定起点。我不知道为什么会这样。如何解决？我需要数据序列和排序。
ruby-on-rails - Rails Elasticsearch 错误 “The stack size specified is too small, Specify at least 160k Error: Could not create the Java Virtual Machine. ”
我正在使用Elasticsearch和Tire进行Rails3项目。当我尝试运行Elastic-search时，安装它后，出现以下错误: The stack size specified is too
c# - 无效的 URI : Invalid port specified in C# on specifying URL with port number in it like http://localhost:8080/jasperserver/rest
我创建了一个简单的函数来执行 Http PUT 请求 - public string checkIfUserExists(string userName) { var endP
Java安全管理器: how to specify "No Permission"?
Java 安全管理器允许通过定义如下子句来指定某些代码段的权限: ... grant codebase http://foo.bar.com/test.jar { permission java
java - Java异常处理中的"specify"
这更像是一种验证。在 Oracle/Java 教程页面上，例如 this , 我一直看到catch 旁边的“specify”就好像它是另一个语句在异常处理中具有一些功能。据我所知，“catch o
org.batfish.specifier.ZoneNameRegexInterfaceSpecifier类的使用及代码示例
本文整理了Java中org.batfish.specifier.ZoneNameRegexInterfaceSpecifier类的一些代码示例，展示了ZoneNameRegexInterfaceSpe
cordova - ionic : compileSdkVersion is not specified
我正在尝试运行以下命令: ionic cordova run android --device 但我收到以下错误 BUILD FAILED in 3s (node:3956) Unha
javascript - 如何获取元素的 "specified"样式而不是计算样式
在不包含 viewport 元标记的网页上，大多数移动浏览器会将页面上的部分或全部字体“提升”到大于 css 指定的大小。例如，在移动版 Safari 上，7px 的指定大小将提升为类似 12px 的
tensorflow - keras中的fit_generator : where is the batch_size specified?
嗨，我不了解 keras fit_generator 文档。我希望我的困惑是理性的。有一个batch_size还有分批训练的概念。使用 model_fit() ，我指定一个 batch_size
Sitecore 内容搜索 : do I have to specify the language?
我使用 IProviderSearchContext 在 Sitecore 8.1(Lucene 搜索)中搜索特定项目，并获得每个项目的两个版本(en、ar)。我的问题是:我是否必须为每个查询指定:
c++ - 为什么转换函数声明不需要至少一个define-type-specifier
Except in a declaration of a constructor, destructor, or conversion function, at least one defining-
nginx - 如何解决Google页面速度: "expiration not specified"
使用 GooglePageSpeed 分析在线商店(Shopware)导致每个图像上出现许多“未指定到期时间”的线条。我想知道是因为网络服务器 (nginx) 在所有图像的响应中添加了 Last-M

首页

博学

6Ren·AI

商城

apache-spark - SparkSQL : How to specify partitioning column while loading dataset from database