gpt4 book ai didi

hadoop - 运行组命令时 Pig 减少作业卡在 50%

转载 作者:可可西里 更新时间:2023-11-01 16:46:22 27 4
gpt4 key购买 nike

我使用以下命令加载了一个包含大约 6000 行数据的文件

A = load '/home/hduser/hdfsdrive/piginput/data/airlines.dat' using PigStorage(',') as (Airline_ID:int, Name:chararray, Alias:chararray, IATA:chararray, ICAO:chararray, Callsign:chararray, Country:chararray, Active:chararray);
B = foreach airline generate Country,Airline_ID;
C = group B by Country;
D = foreach C generate group,COUNT(B);

在上面的代码中,我可以毫无问题地执行前 3 个命令,但是第 4 个命令运行了很长时间。我尝试了以下

dump C;

即使这个卡在同一个地方。这是日志:

2016-04-20 16:08:16,617 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library 2016-04-20 16:08:16,898 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists! 2016-04-20 16:08:17,125 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0 2016-04-20 16:08:17,129 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1da9647b 2016-04-20 16:08:17,190 INFO org.apache.hadoop.mapred.ReduceTask: ShuffleRamManager: MemoryLimit=130652568, MaxSingleShuffleLimit=32663142 2016-04-20 16:08:17,195 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201604201138_0003_r_000000_0 Thread started: Thread for merging on-disk files 2016-04-20 16:08:17,195 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201604201138_0003_r_000000_0 Thread started: Thread for merging in memory files 2016-04-20 16:08:17,195 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201604201138_0003_r_000000_0 Thread waiting: Thread for merging on-disk files 2016-04-20 16:08:17,196 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201604201138_0003_r_000000_0 Need another 1 map output(s) where 0 is already in progress 2016-04-20 16:08:17,196 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201604201138_0003_r_000000_0 Thread started: Thread for polling Map Completion Events 2016-04-20 16:08:17,196 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201604201138_0003_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts) 2016-04-20 16:08:22,197 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201604201138_0003_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts) 2016-04-20 16:09:18,202 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201604201138_0003_r_000000_0 Need another 1 map output(s) where 1 is already in progress 2016-04-20 16:09:18,203 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201604201138_0003_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts) 2016-04-20 16:10:18,208 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201604201138_0003_r_000000_0 Need another 1 map output(s) where 1 is already in progress 2016-04-20 16:10:18,208 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201604201138_0003_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts) 2016-04-20 16:11:18,214 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201604201138_0003_r_000000_0 Need another 1 map output(s) where 1 is already in progress 2016-04-20 16:11:18,214 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201604201138_0003_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts) 2016-04-20 16:11:22,395 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201604201138_0003_r_000000_0 copy failed: attempt_201604201138_0003_m_000000_0 from ubuntu 2016-04-20 16:11:22,396 WARN org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:998) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:934) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:852) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1636) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1593) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1493) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1401) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1333) 2016-04-20 16:11:22,398 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201604201138_0003_r_000000_0: Failed fetch #1 from attempt_201604201138_0003_m_000000_0 2016-04-20 16:11:22,398 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201604201138_0003_r_000000_0 adding host ubuntu to penalty box, next contact in 12 seconds 2016-04-20 16:11:22,398 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201604201138_0003_r_000000_0: Got 1 map-outputs from previous failures 2016-04-20 16:11:37,399 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201604201138_0003_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts) 2016-04-20 16:12:19,403 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201604201138_0003_r_000000_0 Need another 1 map output(s) where 1 is already in progress 2016-04-20 16:12:19,403 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201604201138_0003_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)

即使我停止了所有作业并尝试重新启动,但没有用。我的环境是Ubuntu/Hadoop 1.2.1/Pig 0.15.0

请帮忙。

谢谢,萨西什

最佳答案

我解决了这个问题。问题是/etc/hosts 中配置的 IP 地址不正确。我将其更新为分配给 Ubuntu 机器的 IP 地址并重新启动 Hadoop 服务。我从 hadoop-hduser-jobtracker-ubuntu.log 中发现了这种不匹配,它说:

STARTUP_MSG:   host = ubuntu/10.1.0.249

在 hadoop-hduser-datanode-ubuntu.log 中,它抛出以下错误:

2016-04-25 12:23:05,738 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 10.1.6.173/10.1.6.173:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

基于这些错误,我可以通过 IP 地址跟踪问题并将其修复在/etc/hosts 文件中,然后重新启动服务器。在此之后,所有 Hadoop 作业都可以正常运行,我可以加载数据并运行 PIG 脚本。

谢谢,Sathish。

关于hadoop - 运行组命令时 Pig 减少作业卡在 50%,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36742833/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com