gpt4 book ai didi

apache-spark - 写入数据后使用 .saveAsTable 写入 HDFS 时出现 TImeout 错误

转载 作者:可可西里 更新时间:2023-11-01 15:49:34 26 4
gpt4 key购买 nike

我在 EMR 上运行 Spark 2.3,并尝试使用 Scala 将数据写入 HDFS,如下所示:

dataframe.write.
partitionBy("column1").
bucketBy(1,"column2").
sortBy("column2").
mode("overwrite").
format("parquet").
option("path","hdfs:///destination/").
saveAsTable("result")

写入数据并完成任务后,我收到超时错误。错误发生后,我可以在 HDFS 中看到已完全处理的数据。

为什么会出现这个错误?有什么意义吗?

看起来主节点正在尝试与另一个 IP(不匹配任何节点 IP)进行通信,但数据已经在 HDFS 中。

请注意,当使用 .save("hdfs:///location/").save("s3://bucket/folder/"),仅使用 saveAsTable 方法。我需要使用 saveAsTable 来进行存储和排序。

错误日志片段如下

18/07/23 16:33:31 WARN HiveExternalCatalog: Persisting bucketed data source table `default`.`result` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
18/07/23 16:35:32 ERROR log: Got exception: org.apache.hadoop.net.ConnectTimeoutException Call From ip-master_node_ip/master.node.ip to other_ip.ec2.internal:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=other_ip.ec2.internal/other_ip:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
org.apache.hadoop.net.ConnectTimeoutException: Call From ip-master_node_ip/master.node.ip to other_ip.ec2.internal:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=other_ip.ec2.internal/other_ip:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=other_ip.ec2.internal/other_ip:8020]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788)
at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1550)
at org.apache.hadoop.ipc.Client.call(Client.java:1381)
... 110 more
18/07/23 16:35:32 ERROR log: Converting exception to MetaException
org.apache.hadoop.net.ConnectTimeoutException: Call From ip-master_node_ip/master.node.ip to other_ip.ec2.internal:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=other_ip.ec2.internal/other_ip:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout

... 49 elided
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=other_ip.ec2.internal/other_ip:8020]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788)

作为引用,我尝试了发布的解决方案 here ,但是在路径hdfs:///master_node_ip:8020/location/")中指定主节点IP时仍然出现错误。

最佳答案

如果您的 EMR 集群默认使用 Glue MetaStore,并且那里不存在该数据库,那么您会看到该超时。您可以删除配置或按照建议创建数据库

Classification: hive-site
Property: hive.metastore.client.factory.class
Value: com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
Source: Cluster configuration

关于apache-spark - 写入数据后使用 .saveAsTable 写入 HDFS 时出现 TImeout 错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51484232/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com