gpt4 book ai didi

python - 引导操作成功后,节点预配器中的EMR从引导失败

转载 作者:太空宇宙 更新时间:2023-11-04 05:06:31 25 4
gpt4 key购买 nike

我正在尝试使用带有Spark的EMR在AWS中启动集群。我有一个bash引导脚本,用于安装一些python软件包,下载凭据并应用一些配置。引导操作在主服务器上成功,但在从属服务器上失败。错误的唯一提示是“ i-#####:无法启动。引导操作2失败,退出代码非零”。紧接其前的消息是“ i-#####:引导操作1已完成”。 (在两种情况下,都引用了从服务器的实例ID。主服务器还报告了引导操作1的成功)。

因此,似乎在引导操作2中执行的最后一条命令出现错误,并导致引导脚本返回非零的退出代码。但是,我仅配置了一个引导操作。非主节点是否会自动运行另一个引导操作?

没有日志显示实际错误是什么。我查看了S3上的引导日志(无法可靠显示),并尝试在启动期间在从属服务器和主服务器上添加/ var / log / bootstrap-actions /日志。

我非常确定错误不在我的脚本中(说过每个开发人员……)。我能够创建没有启动的原始EMR群集,然后等待登录,然后以用户hadoop w /无错误运行我的引导脚本。我还检查了最后几个命令(一个grep和一个echo),并验证了它们不会返回非零退出,也不会导致脚本返回非零退出代码。

我认为问题一定出在一些神秘的二次引导动作上。是这样吗如何确定错误?

更新
我在启动期间登录了从节点。我在/emr/instance-controller/lib/bootstrap-actions中找到了引导操作。只有1个子文件夹,其中包含我的引导脚本。然后我跑了
tail -f /emr/instance-controller/log/instance-controller.log。我验证了我的脚本已启动。经过大约15个周期的状态检查(15分钟)后,我看到了

2017-06-02 13:44:30,173 INFO InstanceConfigurer: Script 1 - Execution succeeded


然后我看到另一个AWS脚本正在启动,这似乎是失败的脚本。

2017-06-02 13:44:30,181 INFO InstanceConfigurer: Running provision-node, with id 5aed1c54-4210-4387-944a-4fdbbce6dc8d
2017-06-02 13:44:30,188 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - Fetching file '/var/lib/aws/emr/provision-node'
2017-06-02 13:44:30,188 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - startExec '/var/lib/aws/emr/provision-node'
2017-06-02 13:44:30,189 INFO InstanceConfigurer: startExec '/var/lib/aws/emr/provision-node'
2017-06-02 13:44:30,190 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - Environment:
...
2017-06-02 13:44:54,201 INFO InstanceConfigurer: Output from command '/var/lib/aws/emr/provision-node':
stdout:
stderr:

2017-06-02 13:44:54,202 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - waitProcessCompletion ended with exit code 255 : /var/lib/aws/emr/provision-node
2017-06-02 13:44:54,202 INFO InstanceConfigurer: waitProcessCompletion ended with exit code 255 : /var/lib/aws/emr/provision-node
2017-06-02 13:44:54,203 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - total process run time: 24 seconds
2017-06-02 13:44:54,203 INFO InstanceConfigurer: total process run time: 24 seconds
2017-06-02 13:44:54,217 ERROR InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - Execution for /var/lib/aws/emr/provision-node failed with code '255'
2017-06-02 13:44:54,219 ERROR InstanceConfigurer: Startup failed with
aws157.instancecontroller.common.model.InstanceConfiguratorException: Source: PROVISION_NODE | ErrorCode: SCRIPT_EXECUTION_FAILED_CODE | Execution for /var/lib/aws/emr/provision-node failed with code '255'
at aws157.instancecontroller.common.InstanceConfigurator.runScript(InstanceConfigurator.java:563)
at aws157.instancecontroller.common.InstanceConfigurator.provisionNode(InstanceConfigurator.java:225)
at aws157.instancecontroller.common.InstanceConfigurator.doDistributionConfigure(InstanceConfigurator.java:201)
at aws157.instancecontroller.common.InstanceConfigurator.access$200(InstanceConfigurator.java:70)
at aws157.instancecontroller.common.InstanceConfigurator$1.run(InstanceConfigurator.java:251)


我不熟悉该 /var/lib/aws/emr/provision-node脚本,但是其唯一的内容是

#!/bin/bash
set -ex

sudo /usr/share/aws/emr/node-provisioner/bin/provision-node "$@"


查看 /usr/share/aws/emr/node-provisioner/bin/provision-node,我可以看到该脚本做了很多工作来确定$ EMR_NODE_PROVISIONER_HOME的路径,然后从那里运行以下Java类

java -classpath '/usr/share/aws/emr/node-provisioner/lib/*' com.amazonaws.emr.node.provisioner.Program --phase hadoop _UUID_

我通过查看供应节点脚本的源代码并独立运行来解决这个问题。我无法实时捕获日志或失败以查看出了什么问题。当我单独运行它时,出现以下异常。但是我认为这是因为我传递了垃圾数据而不是UUID(我不知道UUID的来源,并且每个从站的启动都不同)。

2017-06-02 14:55:13,593 ERROR main: Encountered a problem while provisioning
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
at com.amazonaws.emr.node.provisioner.http.JsonHttpClient.doRequest(JsonHttpClient.java:49)
at com.amazonaws.emr.node.provisioner.platform.EmrPlatformClient.getConfiguration(EmrPlatformClient.java:38)
at com.amazonaws.emr.node.provisioner.platform.EmrPlatformClient.getConfiguration(EmrPlatformClient.java:31)
at com.amazonaws.emr.node.provisioner.bigtop.config.PlatformContextProvider.provide(PlatformContextProvider.java:32)
at com.amazonaws.emr.node.provisioner.phase.PhaseWorkflow.work(PhaseWorkflow.java:51)
at com.amazonaws.emr.node.provisioner.phase.ProvisionHadoopPhase.perform(ProvisionHadoopPhase.java:21)
at com.amazonaws.emr.node.provisioner.Program.main(Program.java:20)


所以我现在的问题是com.amazonaws.emr.node.provisioner.Program是什么,为什么会失败(或者我怎么知道为什么?)?

更新2

我设法将/ usr / share / aws / emr / node-provisioner / bin / provision-node的输出一直拖到失败,结果与上面的独立运行相同。

java -classpath '/usr/share/aws/emr/node-provisioner/lib/*' com.amazonaws.emr.node.provisioner.Program --phase hadoop
2017-06-02 17:05:37,869 ERROR main: Encountered a problem while provisioning
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
at com.amazonaws.emr.node.provisioner.http.JsonHttpClient.doRequest(JsonHttpClient.java:49)
at com.amazonaws.emr.node.provisioner.platform.EmrPlatformClient.getConfiguration(EmrPlatformClient.java:38)
at com.amazonaws.emr.node.provisioner.platform.EmrPlatformClient.getConfiguration(EmrPlatformClient.java:31)
at com.amazonaws.emr.node.provisioner.bigtop.config.PlatformContextProvider.provide(PlatformContextProvider.java:32)
at com.amazonaws.emr.node.provisioner.phase.PhaseWorkflow.work(PhaseWorkflow.java:51)
at com.amazonaws.emr.node.provisioner.phase.ProvisionHadoopPhase.perform(ProvisionHadoopPhase.java:21)
at com.amazonaws.emr.node.provisioner.Program.main(Program.java:20)


我猜这可能是防火墙/安全组问题,但我使用的是EMR生成的默认安全组,因此我希望端口可以打开。我正在VPC的专用子网中构建此群集,因此可能是一个问题。但是,当我构建没有引导程序的群集时,不会出现此故障。我的下一个调试步骤是构建不带引导程序的香草集群,并注意此相同命令。

更新3
确认没有网络更改,带有Spark的原始EMR成功部署。 / usr / share / aws / emr / node-provisioner / bin / provision-node中没有错误。启动java命令后,stderr的下一行显示平台配置参数的JSON转储。但是,stdout显示了从回购Bigtop安装的yum软件包。我在脚本或stderr输出(来自set -xe)中没有看到yum命令,因此我认为yum命令必须在该Java程序内。不知道为什么他们在这里成功,但没有引导动作。

我的专用VPC确实具有带子网路由的S3端点和允许访问端点plist的防火墙规则。我的引导脚本能够使用yum(不是来自Bigtop仓库)成功安装软件包,从S3复制文件以及从Internet上的外部git仓库下载代码。

最佳答案

我的引导脚本正在运行yum更新。当我将其注释掉时,我可以跳过配置节点脚本,并最终使集群进入等待状态。更新之一必须是在造成某种冲突或其他问题。我不知道是哪一个。现在,我只是避免运行yum update。

这是百胜日志。我猜它不是R或mysql包之一。也许是Java,内核,aws或util-linux?

Installed:
kernel.x86_64 0:4.9.27-14.31.amzn1

Updated:
R.x86_64 0:3.3.3-1.51.amzn1
R-core.x86_64 0:3.3.3-1.51.amzn1
R-core-devel.x86_64 0:3.3.3-1.51.amzn1
R-devel.x86_64 0:3.3.3-1.51.amzn1
R-java.x86_64 0:3.3.3-1.51.amzn1
R-java-devel.x86_64 0:3.3.3-1.51.amzn1
aws-amitools-ec2.noarch 0:1.5.13-0.2.amzn1
aws-cli.noarch 0:1.11.83-1.46.amzn1
java-1.8.0-openjdk.x86_64 1:1.8.0.131-2.b11.30.amzn1
java-1.8.0-openjdk-devel.x86_64 1:1.8.0.131-2.b11.30.amzn1
java-1.8.0-openjdk-headless.x86_64 1:1.8.0.131-2.b11.30.amzn1
libRmath.x86_64 0:3.3.3-1.51.amzn1
libRmath-devel.x86_64 0:3.3.3-1.51.amzn1
libblkid.x86_64 0:2.23.2-33.28.amzn1
libmount.x86_64 0:2.23.2-33.28.amzn1
libuuid.x86_64 0:2.23.2-33.28.amzn1
mysql-config.x86_64 0:5.5.56-1.17.amzn1
mysql55.x86_64 0:5.5.56-1.17.amzn1
mysql55-devel.x86_64 0:5.5.56-1.17.amzn1
mysql55-libs.x86_64 0:5.5.56-1.17.amzn1
ntp.x86_64 0:4.2.6p5-44.34.amzn1
ntpdate.x86_64 0:4.2.6p5-44.34.amzn1
python27-botocore.noarch 0:1.5.46-1.63.amzn1
python27-jmespath.noarch 0:0.9.2-1.12.amzn1
util-linux.x86_64 0:2.23.2-33.28.amzn1


欢迎进一步的见解。否则,继续实际运行我的代码。

关于python - 引导操作成功后,节点预配器中的EMR从引导失败,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44329998/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com