apache-spark - 我有一个很大的 hql 查询，我正在使用 pyspark sql 调用它。但是我收到错误，例如 Bad connect ack with firstBadLink error-6ren

apache-spark - 我有一个很大的 hql 查询，我正在使用 pyspark sql 调用它。但是我收到错误，例如 Bad connect ack with firstBadLink error

转载作者：可可西里更新时间：2023-11-01 16:36:02

我知道以前也有人问过这个问题，但我问这个问题是因为我不确定问题是否相同。问题是我使用的是 spark-sql，我首先创建了一个表:

sqlContext = HiveContext(sc)

sqlContext.sql("""drop table if exists test_table""")

sqlContext.sql(""" create external table test_table
.
.
.
.
.
.)
partitioned by('column_name' datatype)
stored as textfile 
location '/home/..../test_table' 
""")

This table has like 400-500 columns or even more than that

然后，我使用 union all 从多个巨大的巨大表中获取数据进行插入覆盖

sqlContext.sql("""
    insert overwrite table table_name
    partition(`column_name`) 
select
col1,
col2,
col3,
..
..
from table1
left join ... table2 on ...
left join ... table3
left join ... tale_4
union all
select col1,
col2,
..
..
..
from table5
left join.. ... table6

.
.
.
.union all



from table19
left join tabl18 ... 
""")

请指教。

编辑

18/09/26 22:18:57 INFO scheduler.TaskSetManager: Lost task 590.9 in stage 67.0 (TID 25051) on #####, executor 3: java.io.IOException (Bad connect ack with firstBadLink as *****:1004) [duplicate 15]
18/09/26 22:18:57 INFO scheduler.TaskSetManager: Starting task 590.10 in stage 67.0 (TID 25161, *.com, executor 3, partition 590,NODE_LOCAL, 2348 bytes)
18/09/26 22:18:57 INFO scheduler.TaskSetManager: Starting task 531.10 in stage 67.0 (TID 25162, *.com, executor 13, partition 531,NODE_LOCAL, 2348 bytes)
18/09/26 22:18:57 INFO scheduler.TaskSetManager: Lost task 431.8 in stage 67.0 (TID 25066) on ***, executor 13: java.io.IOException (Bad connect ack with firstBadLink as *******:1004) [duplicate 25]
18/09/26 22:18:57 INFO scheduler.TaskSetManager: Starting task 431.9 in stage 67.0 (TID 25163, ****, executor 13, partition 431,NODE_LOCAL, 2348 bytes)
18/09/26 22:18:57 INFO scheduler.TaskSetManager: Lost task 443.9 in stage 67.0 (TID 25076) on ****, executor 13: java.io.IOException (Bad connect ack with firstBadLink as *****:1004) [duplicate 24]
18/09/26 22:18:57 INFO scheduler.TaskSetManager: Lost task 774.9 in stage 67.0 (TID 25058) on ****, executor 3: java.io.IOException (Bad connect ack with firstBadLink as *****:1004) [duplicate 9]
18/09/26 22:18:57 INFO scheduler.TaskSetManager: Starting task 774.10 in stage 67.0 (TID 25164, ****, executor 15, partition 774,NODE_LOCAL, 2348 bytes)
18/09/26 22:18:57 INFO scheduler.TaskSetManager: Lost task 790.9 in stage 67.0 (TID 25053) on ****, executor 3: java.io.IOException (Bad connect ack with firstBadLink as ******:1004) [duplicate 16]
18/09/26 22:18:57 INFO scheduler.TaskSetManager: Starting task 790.10 in stage 67.0 (TID 25165, ****, executor 15, partition 790,NODE_LOCAL, 2348 bytes)
18/09/26 22:18:57 INFO scheduler.TaskSetManager: Lost task 574.9 in stage 67.0 (TID 25061) on ****, executor 15: java.io.IOException (Bad connect ack with firstBadLink as *****:1004) [duplicate 17]
18/09/26 22:18:57 INFO scheduler.TaskSetManager: Starting task 574.10 in stage 67.0 (TID 25166, ****, executor 3, partition 574,NODE_LOCAL, 2348 bytes)
18/09/26 22:18:57 INFO scheduler.TaskSetManager: Starting task 433.9 in stage 67.0 (TID 25167, ****, executor 14, partition 433,NODE_LOCAL, 2348 bytes)
18/09/26 22:18:57 INFO scheduler.TaskSetManager: Lost task 419.9 in stage 67.0 (TID 25075) on ****, executor 14: java.io.IOException (Bad connect ack with firstBadLink as *****:1004) [duplicate 26]
18/09/26 22:18:57 INFO scheduler.TaskSetManager: Lost task 474.9 in stage 67.0 (TID 25054) on ****, executor 15: java.io.IOException (Bad connect ack with firstBadLink as ****:1004) [duplicate 10]
18/09/26 22:18:57 INFO scheduler.TaskSetManager: Starting task 474.10 in stage 67.0 (TID 25168, ****, executor 3, partition 474,NODE_LOCAL, 2348 bytes)
18/09/26 22:18:57 INFO scheduler.TaskSetManager: Starting task 436.10 in stage 67.0 (TID 25169, ****, executor 19, partition 436,NODE_LOCAL, 2348 bytes)
18/09/26 22:18:57 WARN scheduler.TaskSetManager: Lost task 411.8 in stage 67.0 (TID 25056, ****, executor 19): java.io.IOException: Bad connect ack with firstBadLink as ****:1004
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1643)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1541)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:683)

File "/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 580, in sql
    return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
  File "/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
18/09/26 22:19:09 WARN scheduler.TaskSetManager: Lost task 1210.4 in stage 67.0 (TID 25307, ****.com, executor 8): TaskKilled (killed intentionally)
    return f(*a, **kw)
  File "/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
    format(target_id, ".", name), value)
18/09/26 22:19:09 WARN scheduler.TaskSetManager: Lost task 449.12 in stage 67.0 (TID 25300, ***.com, executor 14): TaskKilled (killed intentionally)
Py4JJavaError: An error occurred while calling o61.sql.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 403 in stage 67.0 failed 14 times, most recent failure: Lost task 403.13 in stage 67.0 (TID 25227, *******, executor 7): java.io.IOException: Bad connect ack with firstBadLink as ******:1004
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1643)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1541)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:683)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1642)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1601)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1590)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1844)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1857)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1934)
        at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.saveAsHiveFile(InsertIntoHiveTable.scala:84)
        at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:201)
        at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
        at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
        at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:145)
        at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:130)
        at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
        at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Bad connect ack with firstBadLink as ******:1004
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1643)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1541)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:683)

最佳答案

当您写入大数据时，您存储数据的文件必须被压缩。这意味着您必须使用 Parquet 而不是 Textfile。我有同样的错误，我通过使用 Parquet 解决了它，如下所示:

...
partitioned by('column_name' datatype)
stored as parquet
location '/home/..../test_table'
...

试试吧!

关于apache-spark - 我有一个很大的 hql 查询，我正在使用 pyspark sql 调用它。但是我收到错误，例如 Bad connect ack with firstBadLink error，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52514474/

文章推荐： c++ - 模板、函数指针和 C++0x

文章推荐： c++ - 不写两次函数头的方法？

文章推荐： javascript - python-requests模块，post 2个 "values"更新&抓取网站

文章推荐： c++ - boost::enable_if 类模板方法

400 bad request什么意思？400 bad request的解决方法
我们的电脑在使用的过程中，有的小伙伴在上网的时候可能就遇到过系统提示：400 bad request的情况。据小编所知这种情况，大致意思就是出现了错误的请求或者请求不能满足。原因是因为我们请求的语法
502 bad gateway是什么意思 502 bad gateway错误解决方法
您可以尝试清除浏览器缓存访问一下你的ＦＴＰ看是否可以登陆成功解决502 Bad Gateway错误今天登陆博客，显示502 bad gateway，NGINX最烦人的地方就是经常会出现这个
C: "read: Bad Address"和 "write: Bad Address"
我想要具有 FIFO 的服务器-客户端模型和客户端获取目录路径，但我收到错误“读:错误地址”和“写:错误地址”。客户端服务器错误:“读取:地址错误” 客户端错误:“写入:地址错误” 最佳答案您可
haskell - 为什么归纳数据类型禁止像 `data Bad a = C (Bad a -> a)` 这样的类型，其中类型递归发生在 -> 前面？
Agda 手册 Inductive Data Types and Pattern Matching状态: To ensure normalisation, inductive occurrences
java - Maven 编译器插件错误 : can't access enum (bad signature, bad class)
我正在使用 maven-compiler-plugin:2.3.2 并且每次我对在导入中具有枚举 (ContentType) 的类进行更改时，我需要使干净，否则它会给我: ERROR] Failed
Facebook 隐私政策 URL : Bad Response Code: URL returned a bad HTTP response code
我想发布我的第一个 Facebook 应用程序，需要一个隐私政策 URL。我在我的网站上发布了 privacypolicy.html 页面，但是当我在“应用程序详细信息”中配置它时，我收到了下一条消
VS Code: Get 'Bad credentials' when using Clone Repository(VS代码：使用克隆存储库时获得“Bad credentials”)
vscode 1.45.1版本使用克隆存储库时，我收到“Bad credentials”。最近我在github上换了用户名。可能就是这个原因。我如何告诉vs code？
shell - Cron Bad Minute 错误 : "home.cron":0: bad minute crontab: errors in crontab file, 无法安装
我正在 Mac OS 终端上创建 cron，代码如下: home.cron 的内容: * * * * * /users/username/desktop/forTrump/script.sh 然后我这
python - 引发 ValueError ("bad input shape {0}".format(shape)) ValueError : bad input shape (10, 90)
我是新手，所以需要任何帮助，当我要求一个例子时，我的教授给我了这段代码，我希望有一个工作模型...... from numpy import loadtxt import numpy as np fr
linux - "CIFS VFS: cifs_mount failed w/return code = -22"和 "wrong fs type, bad option, bad superblock"
我使用 linux 服务器已经有一段时间了，通过使用 cifs 挂载到多个 Windows 共享。到目前为止，我总是在/etc/fstab 中有一行://IPADDRESS/sharename/mn
java - org.apache.solr.common.SolrException : Bad Request Bad Request request: http://localhost:8080/solr/update? wt=javabin&version=2
请大家帮帮我我正在尝试使用 NUTCH 抓取网站，但它给我错误“java.io.IOException: Job failed!” 我正在运行此命令“bin/nutch solrindex http:
.net - 基础业务类 : is it bad?
我想创建我的基础业务类，例如 EntityBase，以具有一些常见的行为，例如实现用于跟踪对象更改的接口(interface)(IsNew、IsDirty)和 INotifyPropertyChang
performance - 异步和等待 : are they bad?
我们最近开发了一个基于 SOA 的站点，但是这个站点在负载过重时最终会出现严重的负载和性能问题。我在这里发布了一个与此问题相关的问题: ASP.NET website becomes unrespon
Azure函数应用程序间歇性返回502 Bad Gateway
我们的 Azure 功能已开始返回 502 Bad Gateways，但并非所有调用都返回。我没有使用“间歇性”这个词，因为它总是进行相同类型的调用，但现在总是使用相同的数据。常规配置 Azure
.net - "bad"在字典中使用对象作为键吗？
我假设在字典中进行查找时，它需要散列您提供的 key ，然后使用该散列来查找您要查找的对象。如果是这样，使用较大的对象作为键是否会显着减慢查找速度或产生其他使用字符串或简单数据类型作为键不会遇到的后
Java抽象方法: is this bad practice?
我的代码如下: public static final Condition.ActionCondition ACTION_CONDITION_ACTIVATE = new Condit
javascript - 谷歌地图显示 BAD
大家好，我有一个应用程序和一个表单，我要求用户在其中输入地址，并在文本字段下方显示带有标记的谷歌地图，用户可以在其中将标记拖/放到正确的位置。问题是，在显示 map 的开始时，它只是部分显示而不是全部
清空所有 'bad' 列和行的算法
给定字节矩阵(所有值在内存中都是 1 位)，如果其中至少有一个零，则称其为原始列或“坏”列。查找算法，占用 O(1) 额外内存。如果没有另一个值(如 -1)或另一个重复矩阵来跟踪已经找到的空值，并且
PHP 严格标准 : is this bad?
当我创建一个标准类时，我主要这样做: $test = null; $test->id = 1; $test->name = 'name'; 但是在严格模式下我得到一个错误。显然正确的做法是: $te
c++ - "Bad"GCC优化性能
我试图理解为什么将 -O2 -march=native 与 GCC 一起使用会比不使用它们时产生更慢的代码。请注意，我在 Windows 7 下使用 MinGW (GCC 4.7.1)。这是我的代码

可可西里

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

apache-spark - 我有一个很大的 hql 查询，我正在使用 pyspark sql 调用它。但是我收到错误，例如 Bad connect ack with firstBadLink error

编辑