- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
在我的应用程序中,我比较了两个不同的数据集(即来自 Hive 的源表和来自 RDBMS 的目标表)的重复和不匹配,它适用于较小的数据集但是当我尝试比较超过 1GB 的数据时(单独的源)它挂起并抛出 TIMEOUT ERROR
,我尝试了 .config("spark.network.timeout", "600s")
即使在增加网络超时后它抛出 java. lang.OutOfMemoryError:超出 GC 开销限制
。
val spark = SparkSession.builder().master("local")
.appName("spark remote")
.config("javax.jdo.option.ConnectionURL", "jdbc:mysql://192.168.175.160:3306/metastore?useSSL=false")
.config("javax.jdo.option.ConnectionUserName", "hiveroot")
.config("javax.jdo.option.ConnectionPassword", "hivepassword")
.config("hive.exec.scratchdir", "/tmp/hive/${user.name}")
.config("hive.metastore.uris", "thrift://192.168.175.160:9083")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
import spark.sql
val source = spark.sql("SELECT * from sample.source").rdd.map(_.mkString(","))
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
val sparkdestination = SparkSession.builder().master("local").appName("Database")
.config("spark.network.timeout", "600s")
.getOrCreate()
val jdbcUsername = "root"
val jdbcPassword = "root"
val url = "jdbc:mysql://192.168.175.35:3306/sample?useSSL=false"
val connectionProperties = new java.util.Properties()
connectionProperties.put("user", jdbcUsername)
connectionProperties.put("password", jdbcPassword)
val queryDestination = "(select * from destination) as dest"
val destination = sparkdestination.read.jdbc(url, queryDestination, connectionProperties).rdd.map(_.mkString(","))
我也尝试过 destination.persist(StorageLevel.MEMORY_AND_DISK_SER)
(MEMORY_AND_DISK,DISK_ONLY) 方法,但没有成功。
编辑:这是原始错误堆栈:
17/07/11 12:49:43 INFO DAGScheduler: Submitting 22 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[13] at map at stack.scala:76)
17/07/11 12:49:43 INFO TaskSchedulerImpl: Adding task set 1.0 with 22 tasks
17/07/11 12:49:43 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/07/11 12:51:38 INFO JDBCRDD: closed connection
17/07/11 12:51:38 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2210)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1989)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3410)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:470)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3112)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2341)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2736)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2490)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1858)
at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1966)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:301)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
17/07/11 12:51:38 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2210)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1989)
17/07/11 12:49:43 INFO DAGScheduler: Submitting 22 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[13] at map at stack.scala:76)
17/07/11 12:49:43 INFO TaskSchedulerImpl: Adding task set 1.0 with 22 tasks
17/07/11 12:49:43 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/07/11 12:51:38 INFO JDBCRDD: closed connection
17/07/11 12:51:38 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2210)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1989)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3410)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:470)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3112)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2341)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2736)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2490)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1858)
at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1966)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:301)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
17/07/11 12:51:38 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2210)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1989)
编辑 2:
我尝试使用:
val options = Map(
"url" -> "jdbc:mysql://192.168.175.35:3306/sample?useSSL=false",
"dbtable" -> queryDestination,
"user" -> "root",
"password" -> "root")
val destination = sparkdestination.read.options(options).jdbc(options("url"), options("dbtable"), "0", 1, 5, 4, new java.util.Properties()).rdd.map(_.mkString(","))
我用少量数据检查了它的工作情况,但在使用大型数据集时它抛出了下面的错误
错误
17/07/11 14:12:46 INFO DAGScheduler: looking for newly runnable stages
17/07/11 14:12:46 INFO DAGScheduler: running: Set(ShuffleMapStage 1)
17/07/11 14:12:46 INFO DAGScheduler: waiting: Set(ResultStage 2)
17/07/11 14:12:46 INFO DAGScheduler: failed: Set()
17/07/11 14:12:50 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.175.160:39913 in memory (size: 19.9 KB, free: 353.4 MB)
17/07/11 14:14:47 WARN ServerConnector:
17/07/11 14:15:32 WARN QueuedThreadPool:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.String.substring(String.java:1969)
17/07/11 14:15:32 ERROR Utils: uncaught error in thread Spark Context Cleaner, stopping SparkContext
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:179)
17/07/11 14:15:32 WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver, [Lscala.Tuple2;@1e855db,BlockManagerId (driver, 192.168.175.160, 39913, None))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
17/07/11 14:15:32 ERROR Utils: throw uncaught fatal error in thread Spark Context Cleaner
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:179)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1245)
17/07/11 14:15:32 WARN QueuedThreadPool: Unexpected thread death: org.spark_project.jetty.util.thread.QueuedThreadPool$3@710104 in SparkUI{STARTED,8<=8<=200,i=5,q=0}
17/07/11 14:15:32 INFO JDBCRDD: closed connection
17/07/11 14:15:32 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 22)
java.lang.OutOfMemoryError: GC overhead limit exceeded
17/07/11 14:15:32 INFO SparkUI: Stopped Spark web UI at http://192.168.175.160:4040
17/07/11 14:15:32 INFO DAGScheduler: Job 0 failed: collect at stack.scala:93, took 294.365864 s
Exception in thread "main" org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:808)
17/07/11 14:15:32 INFO DAGScheduler: ShuffleMapStage 1 (map at stack.scala:85) failed in 294.165 s due to Stage cancelled because SparkContext was shut down
17/07/11 14:15:32 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@cfb906)
17/07/11 14:15:32 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(0,1499762732342,JobFailed(org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down))
17/07/11 14:15:32 ERROR SparkUncaughtExceptionHandler: [Container in shutdown] Uncaught exception in thread Thread[Executor task launch worker-1,5,main]
java.lang.OutOfMemoryError: GC overhead limit exceeded
17/07/11 14:15:32 INFO DiskBlockManager: Shutdown hook called
17/07/11 14:15:32 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/07/11 14:15:32 INFO ShutdownHookManager: Shutdown hook called
17/07/11 14:15:32 INFO MemoryStore: MemoryStore cleared
17/07/11 14:15:32 INFO BlockManager: BlockManager stopped
17/07/11 14:15:32 INFO BlockManagerMaster: BlockManagerMaster stopped
17/07/11 14:15:32 INFO ShutdownHookManager: Deleting directory /tmp/spark-0b2ea8bd-95c0-45e4-a1cc-bd62b3899b24
17/07/11 14:15:32 INFO ShutdownHookManager: Deleting directory /tmp/spark-0b2ea8bd-95c0-45e4-a1cc-bd62b3899b24/userFiles-194d73ba-fcfa-4616-ae17-78b0bba6b465
17/07/11 14:15:32 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
Spark 配置
虽然处于开发模式,但我正在使用 2g 内存和 1 个内核执行。我是 spark 的新手,很抱歉提出如此幼稚的问题。
谢谢!
最佳答案
首先,您正在启动两个非常无用的 SparkSession
,您只是在拆分 资源。所以不要那样做!
其次,这就是问题所在。关于 Apache Spark 的并行性和 jdbc
源存在误解(别担心,这是一个陷阱!)。
这主要是由于缺少文档。 (我最后一次检查过)
回到问题。实际发生的是以下行:
val destination = spark.read.jdbc(url, queryDestination, connectionProperties).rdd.map(_.mkString(","))
是它将读取委托(delegate)给单个工作人员。
所以主要是,如果您有足够的内存并且您成功读取了该数据。整个destination
数据将在一个分区 中。而一个分区就意味着麻烦!又名可能:
java.lang.OutOfMemoryError: GC overhead limit exceeded
所以发生的事情是,被选择用来获取数据的单个执行器不堪重负,它的 JVM 崩溃了。
现在让我们解决这个问题:
(免责声明:以下代码摘自 spark-gotchas ,我是其作者之一。)
所以让我们创建一些示例数据并将它们保存在我们的数据库中:
val options = Map(
"url" -> "jdbc:postgresql://127.0.0.1:5432/spark",
"dbtable" -> "data",
"driver" -> "org.postgresql.Driver",
"user" -> "spark",
"password" -> "spark"
)
val newData = spark.range(1000000)
.select($"id", lit(""), lit(true), current_timestamp())
.toDF("id", "name", "valid", "ts")
newData.write.format("jdbc").options(options).mode("append").save
Apache Spark 提供了两种用于通过 JDBC 加载分布式数据的方法。第一个使用整数列对数据进行分区:
val dfPartitionedWithRanges = spark.read.options(options)
.jdbc(options("url"), options("dbtable"), "id", 1, 5, 4, new java.util.Properties())
dfPartitionedWithRanges.rdd.partitions.size
// Int = 4
dfPartitionedWithRanges.rdd.glom.collect
// Array[Array[org.apache.spark.sql.Row]] = Array(
// Array([1,foo,true,2012-01-01 00:03:00.0]),
// Array([2,foo,false,2013-04-02 10:10:00.0]),
// Array([3,bar,true,2015-11-02 22:00:00.0]),
// Array([4,bar,false,2010-11-02 22:00:00.0]))
Partition column and bounds can provided using options as well:
val optionsWithBounds = options ++ Map(
"partitionColumn" -> "id",
"lowerBound" -> "1",
"upperBound" -> "5",
"numPartitions" -> "4"
)
spark.read.options(optionsWithBounds).format("jdbc").load
也可以使用选项提供分区列和边界:
val optionsWithBounds = options ++ Map(
"partitionColumn" -> "id",
"lowerBound" -> "1",
"upperBound" -> "5",
"numPartitions" -> "4"
)
spark.read.options(optionsWithBounds).format("jdbc").load
另一种选择是使用一系列谓词,但我不会在这里谈论它。
您可以阅读有关 Spark SQL 和 JDBC 源代码的更多信息 here以及其他一些问题。
希望对您有所帮助。
关于apache-spark - 读取大量数据集时 Spark 2.1 挂起,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45027091/
目前正在学习 Spark 的类(class)并了解到执行者的定义: Each executor will hold a chunk of the data to be processed. Thisc
阅读了有关 http://spark.apache.org/docs/0.8.0/cluster-overview.html 的一些文档后,我有一些问题想要澄清。 以 Spark 为例: JavaSp
Spark核心中的调度器与以下Spark Stack(来自Learning Spark:Lightning-Fast Big Data Analysis一书)中的Standalone Schedule
我想在 spark-submit 或 start 处设置 spark.eventLog.enabled 和 spark.eventLog.dir -all level -- 不要求在 scala/ja
我有来自 SQL Server 的数据,需要在 Apache Spark (Databricks) 中进行操作。 在 SQL Server 中,此表的三个键列使用区分大小写的 COLLATION 选项
所有这些有什么区别和用途? spark.local.ip spark.driver.host spark.driver.bind地址 spark.driver.hostname 如何将机器修复为 Sp
我有大约 10 个 Spark 作业,每个作业都会进行一些转换并将数据加载到数据库中。必须为每个作业单独打开和关闭 Spark session ,每次初始化都会耗费时间。 是否可以只创建一次 Spar
/Downloads/spark-3.0.1-bin-hadoop2.7/bin$ ./spark-shell 20/09/23 10:58:45 WARN Utils: Your hostname,
我是 Spark 的完全新手,并且刚刚开始对此进行更多探索。我选择了更长的路径,不使用任何 CDH 发行版安装 hadoop,并且我从 Apache 网站安装了 Hadoop 并自己设置配置文件以了解
TL; 博士 Spark UI 显示的内核和内存数量与我在使用 spark-submit 时要求的数量不同 更多细节: 我在独立模式下运行 Spark 1.6。 当我运行 spark-submit 时
spark-submit 上的文档说明如下: The spark-submit script in Spark’s bin directory is used to launch applicatio
关闭。这个问题是opinion-based .它目前不接受答案。 想改善这个问题吗?更新问题,以便可以通过 editing this post 用事实和引文回答问题. 6 个月前关闭。 Improve
我想了解接收器如何在 Spark Streaming 中工作。根据我的理解,将有一个接收器任务在执行器中运行,用于收集数据并保存为 RDD。当调用 start() 时,接收器开始读取。需要澄清以下内容
有没有办法在不同线程中使用相同的 spark 上下文并行运行多个 spark 作业? 我尝试使用 Vertx 3,但看起来每个作业都在排队并按顺序启动。 如何让它在相同的 spark 上下文中同时运行
我们有一个 Spark 流应用程序,这是一项长期运行的任务。事件日志指向 hdfs 位置 hdfs://spark-history,当我们开始流式传输应用程序时正在其中创建 application_X
我们正在尝试找到一种加载 Spark (2.x) ML 训练模型的方法,以便根据请求(通过 REST 接口(interface))我们可以查询它并获得预测,例如http://predictor.com
Spark newb 问题:我在 spark-sql 中进行完全相同的 Spark SQL 查询并在 spark-shell . spark-shell版本大约需要 10 秒,而 spark-sql版
我正在使用 Spark 流。根据 Spark 编程指南(参见 http://spark.apache.org/docs/latest/programming-guide.html#accumulato
我正在使用 CDH 5.2。我可以使用 spark-shell 运行命令。 如何运行包含spark命令的文件(file.spark)。 有没有办法在不使用 sbt 的情况下在 CDH 5.2 中运行/
我使用 Elasticsearch 已经有一段时间了,但使用 Cassandra 的经验很少。 现在,我有一个项目想要使用 Spark 来处理数据,但我需要决定是否应该使用 Cassandra 还是
我是一名优秀的程序员,十分优秀!