gpt4 book ai didi

apache-spark - 几次运行后,Databricks 集群上计划的 Spark 作业间歇性失败

转载 作者:行者123 更新时间:2023-12-05 07:17:36 25 4
gpt4 key购买 nike

当前设置 - Azure 数据工厂管道计划每 15 分钟运行一次,在始终在线的交互式数据 block 集群上运行一些数据 block 笔记本。

这里面临的问题是 - 此管道在 4-5 次运行后失败。由于 Spark Driver 的问题。没有可能导致驱动程序内存填满的 Collect 语句。当驱动程序尝试将信息写入内部 Metastore(由 Databricks 自动管理)时,错误日志显示问题。该线程导致 GC 开销限制违反并导致 Full GC。结果驱动程序被杀死,Notebook 运行失败。

这是日志-

19/11/06 04:56:47 ERROR DatabricksMain$DBUncaughtExceptionHandler: Uncaught exception in thread db-atomic-read-worker-5095!
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
at java.lang.StringBuilder.append(StringBuilder.java:190)
at java.io.ObjectInputStream$BlockDataInputStream.readUTFSpan(ObjectInputStream.java:3506)
at java.io.ObjectInputStream$BlockDataInputStream.readUTFBody(ObjectInputStream.java:3414)
at java.io.ObjectInputStream$BlockDataInputStream.readUTF(ObjectInputStream.java:3226)
at java.io.ObjectInputStream.readString(ObjectInputStream.java:1905)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1564)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at java.util.Hashtable.readObject(Hashtable.java:1213)
at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1170)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2178)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at org.apache.commons.lang3.SerializationUtils.clone(SerializationUtils.java:94)
at org.apache.spark.SparkContext$$anon$2.childValue(SparkContext.scala:370)
at org.apache.spark.SparkContext$$anon$2.childValue(SparkContext.scala:366)
at java.lang.ThreadLocal$ThreadLocalMap.<init>(ThreadLocal.java:391)
at java.lang.ThreadLocal$ThreadLocalMap.<init>(ThreadLocal.java:298)
at java.lang.ThreadLocal.createInheritedMap(ThreadLocal.java:255)
at java.lang.Thread.init(Thread.java:420)
at java.lang.Thread.init(Thread.java:349)
at java.lang.Thread.<init>(Thread.java:511)
at sun.security.ssl.SSLSocketImpl$NotifyHandshakeThread.<init>(SSLSocketImpl.java:2675)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1096)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395)
19/11/06 04:56:47 ERROR DatabricksMain$DBUncaughtExceptionHandler: OutOfMemoryError in thread db-atomic-read-worker-5095! Killing thread now.
19/11/06 04:56:47 WARN TrapExitSecurityManager: Called "System.exit(15)" in db-atomic-read-worker-5095!
Stack Trace:
java.lang.Thread.getStackTrace(Thread.java:1559)
com.databricks.backend.daemon.driver.TrapExitSecurityManager.checkExit(DriverLocal.scala:686)
java.lang.Runtime.halt(Runtime.java:273)
com.databricks.DatabricksMain$DBUncaughtExceptionHandler.uncaughtException(DatabricksMain.scala:363)
java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1057)
java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1052)
java.lang.Thread.dispatchUncaughtException(Thread.java:1959)

19/11/06 04:56:47 WARN TrapExitSecurityManager: Allowed to exit because this is OOM!
19/11/06 04:56:52 INFO StaticConf$: DB_HOME: /databricks
19/11/06 04:56:53 INFO DriverDaemon$: ========== driver starting up ==========
19/11/06 04:56:53 INFO DriverDaemon$: Java: Private Build 1.8.0_222
19/11/06 04:56:53 INFO DriverDaemon$: OS: Linux/amd64 4.15.0-1050-azure
19/11/06 04:56:53 INFO DriverDaemon$: CWD: /databricks/driver

非托管 Metastore 的连接问题 -

urrent allocation: Map(1414820437514047686 -> 1, 289483405015881873 -> 175)
Ideal allocation: Map(1414820437514047686 -> 88, 289483405015881873 -> 88)
Starved pools: Map(1414820437514047686 -> 98.420017518)
19/11/06 04:55:37 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 588 to 10.139.64.20:49530
19/11/06 04:55:29 ERROR BoneCP: Failed to acquire connection to jdbc:mariadb://consolidated-westeurope-prod-metastore-addl-1.mysql.database.azure.com:3306/organization4787651615040525?trustServerCertificate=true&useSSL=true. Sleeping for 7000 ms. Attempts left: 5
java.sql.SQLNonTransientConnectionException: Could not connect to consolidated-westeurope-prod-metastore-addl-1.mysql.database.azure.com:3306 : Connection reset
at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.get(ExceptionMapper.java:161)
at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.getException(ExceptionMapper.java:106)
at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.connectWithoutProxy(AbstractConnectProtocol.java:1036)
at org.mariadb.jdbc.internal.util.Utils.retrieveProxy(Utils.java:490)
at org.mariadb.jdbc.MariaDbConnection.newConnection(MariaDbConnection.java:144)
at org.mariadb.jdbc.Driver.connect(Driver.java:90)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361)
at com.jolbox.bonecp.BoneCP.obtainInternalConnection(BoneCP.java:269)
at com.jolbox.bonecp.ConnectionHandle.<init>(ConnectionHandle.java:242)
at com.jolbox.bonecp.PoolWatchThread.fillConnections(PoolWatchThread.java:115)
at com.jolbox.bonecp.PoolWatchThread.run(PoolWatchThread.java:82)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.sql.SQLNonTransientConnectionException: Could not connect to consolidated-westeurope-prod-metastore-addl-1.mysql.database.azure.com:3306 : Connection reset
at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.get(ExceptionMapper.java:161)
at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.connException(ExceptionMapper.java:79)
at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.handleConnectionPhases(AbstractConnectProtocol.java:724)
at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.connect(AbstractConnectProtocol.java:402)
at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.connectWithoutProxy(AbstractConnectProtocol.java:1032)
... 13 more
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
at sun.security.ssl.SSLSocketImpl.waitForClose(SSLSocketImpl.java:1761)
at sun.security.ssl.HandshakeOutStream.flush(HandshakeOutStream.java:124)
at sun.security.ssl.Handshaker.kickstart(Handshaker.java:1079)
at sun.security.ssl.SSLSocketImpl.kickstartHandshake(SSLSocketImpl.java:1479)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1346)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379)
at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.handleConnectionPhases(AbstractConnectProtocol.java:676)
... 15 more
19/11/06 04:55:37 WARN PreemptionMonitor: Preempted 43/43 tasks from 289483405015881873.
19/11/06 04:55:53 WARN PreemptionMonitor: Attempting to preempt 43 tasks from overallocated pools.
19/11/06 04:55:53 INFO PreemptionMonitor: Current allocation state:
Current max parallelism: 176

我很感激任何问题的答案-

1- 我可以调整任何 spark 作业/databricks 集群参数以避免此类驱动程序故障?

2-如何避免连接到 Metastore 的守护进程造成内存堆积。通过引起一些本地 GC 在 Job 提交后刷新内存。

3- 我在哪里可以看到/控制这个非托管 Metastore?

最佳答案

我遇到过类似的问题。我的问题是驱动节点的内存用完了。

在我的 spark 日志记录中,错误之前有完整的 GC 日志记录。

顺便说一句,我正在使用 Azure 数据 block 。

关于apache-spark - 几次运行后,Databricks 集群上计划的 Spark 作业间歇性失败,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58725934/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com