gpt4 book ai didi

hadoop - Reducer 因主机死机而卡住

转载 作者:可可西里 更新时间:2023-11-01 15:07:14 27 4
gpt4 key购买 nike

我注意到我的 reducer 由于主机死机而卡住了。在日志上,它显示了很多重试消息。是否可以告诉作业跟踪器放弃死节点并恢复工作?有 323 个映射器,只有 1 个 reducer 。我在 hadoop-1.0.3 上。

2012-08-08 11:52:19,903 INFO org.apache.hadoop.mapred.ReduceTask: 192.168.1.23 Will be considered after: 65 seconds.
2012-08-08 11:53:19,905 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 Need another 63 map output(s) where 0 is already in progress
2012-08-08 11:53:19,905 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 Scheduled 0 outputs (1 slow hosts and0 dup hosts)
2012-08-08 11:53:19,905 INFO org.apache.hadoop.mapred.ReduceTask: Penalized(slow) Hosts:
2012-08-08 11:53:19,905 INFO org.apache.hadoop.mapred.ReduceTask: 192.168.1.23 Will be considered after: 5 seconds.
2012-08-08 11:53:29,906 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2012-08-08 11:53:47,907 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 copy failed: attempt_201207191440_0203_m_000001_0 from 192.168.1.23
2012-08-08 11:53:47,907 WARN org.apache.hadoop.mapred.ReduceTask: java.net.NoRouteToHostException: No route to host
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:327)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:193)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:180)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
at java.net.Socket.connect(Socket.java:546)
at sun.net.NetworkClient.doConnect(NetworkClient.java:173)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:409)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:530)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:240)
at sun.net.www.http.HttpClient.New(HttpClient.java:321)
at sun.net.www.http.HttpClient.New(HttpClient.java:338)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:935)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:876)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:801)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1618)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1575)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1483)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1394)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1326)

2012-08-08 11:53:47,907 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201207191440_0203_r_000000_0: Failed fetch #18 from attempt_201207191440_0203_m_000001_0
2012-08-08 11:53:47,907 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 adding host 192.168.1.23 to penalty box, next contact in 1124 seconds
2012-08-08 11:53:47,907 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0: Got 1 map-outputs from previous failures
2012-08-08 11:54:22,909 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 Need another 63 map output(s) where 0 is already in progress
2012-08-08 11:54:22,909 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 Scheduled 0 outputs (1 slow hosts and0 dup hosts)
2012-08-08 11:54:22,909 INFO org.apache.hadoop.mapred.ReduceTask: Penalized(slow) Hosts:
2012-08-08 11:54:22,909 INFO org.apache.hadoop.mapred.ReduceTask: 192.168.1.23 Will be considered after: 1089 seconds.

我不管它,它重试了一段时间,然后放弃了死机并重新运行映射器并成功了。这是由主机上的两个 ip 引起的,我故意关闭了一个 hadoop 使用的 ip。

我的问题是是否有办法告诉 hadoop 放弃死机而不重试。

最佳答案

从您的日志中,您可以看到运行 map task 的任务跟踪器之一无法连接。运行 reducer 的 tasktracker 试图通过 HTTP 协议(protocol)检索 map 中间结果,但它失败了,因为有结果的 tasktracker 已经死了。

tasktracker 失败的默认行为是这样的:

jobtracker 安排在失败的 tasktracker 上运行并成功完成的 map 任务如果属于未完成的作业则重新运行,因为它们驻留在失败的 tasktracker 的本地文件系统上的中间输出可能无法被 reduce 任务访问。任何正在进行的任务也会重新安排。

问题是,如果一个任务(无论是 map 还是 reduce)失败太多次(我想是 4 次),它就不会再被重新安排,作业就会失败。在您的情况下, map 似乎已成功完成,但缩减器无法连接到映射器并检索中间结果。它尝试了 4 次,之后作业失败了。

失败的任务不能完全忽略,因为它是作业的一部分,除非作业包含的所有任务都成功,否则作业本身不会成功。

尝试找到 reducer 尝试访问的链接并将其复制到浏览器中以查看您收到的错误。

您还可以将一个节点列入黑名单并将其完全排除在 Hadoop 使用的节点列表之外:

  In conf/mapred-site.xml

<property>
<name>mapred.hosts.exclude</name>
<value>/full/path/of/host/exclude/file</value>
</property>

To reconfigure nodes.

/bin/hadoop mradmin -refreshNodes

关于hadoop - Reducer 因主机死机而卡住,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11871293/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com