gpt4 book ai didi

python - AWS-EMR 错误退出代码 143

转载 作者:行者123 更新时间:2023-12-01 09:14:37 26 4
gpt4 key购买 nike

我正在 AWS EMR 上运行分析,但收到意外的 SIGTERM 错误。

一些背景:

我正在运行一个脚本,该脚本读取我存储在 S3 上的许多 csv 文件,然后执行分析。我的脚本示意如下:

分析脚本.py

import pandas as pd
from pyspark.sql import SQLContext, DataFrame
from pyspark.sql.types import *
from pyspark import SparkContext
import boto3

#Spark context
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

df = sqlContext.read.csv("s3n://csv_files/*", header = True)

def analysis(df):
#do bunch of stuff. Create output dataframe
return df_output

df_output = analysis(df)

我使用以下方式启动集群:

aws emr create-cluster 
--release-label emr-5.5.0
--name "Analysis"
--applications Name=Hadoop Name=Hive Name=Spark Name=Ganglia
--ec2-attributes KeyName=EMRB,InstanceProfile=EMR_EC2_DefaultRole
--service-role EMR_DefaultRole
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r3.xlarge InstanceGroupType=CORE,InstanceCount=4,InstanceType=r3.xlarge
--region us-west-2
--log-uri s3://emr-logs/
--bootstrap-actions Name="Install Python Packages",Path="s3://emr-bootstraps/install_python_packages_custom.bash",Args=["numpy pandas boto3 tqdm"]
--auto-terminate

我可以从日志中看到 csv 文件的读取正常。但随后它就以错误结束了。以下行位于 stderr 文件中:

18/07/16 12:02:26 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
18/07/16 12:02:26 ERROR ApplicationMaster: User application exited with status 143
18/07/16 12:02:26 INFO ApplicationMaster: Final app status: FAILED, exitCode: 143, (reason: User application exited with status 143)
18/07/16 12:02:26 INFO SparkContext: Invoking stop() from shutdown hook
18/07/16 12:02:26 INFO SparkUI: Stopped Spark web UI at http://172.31.36.42:36169
18/07/16 12:02:26 INFO TaskSetManager: Starting task 908.0 in stage 1494.0 (TID 88112, ip-172-31-35-59.us-west-2.compute.internal, executor 27, partition 908, RACK_LOCAL, 7278 bytes)
18/07/16 12:02:26 INFO TaskSetManager: Finished task 874.0 in stage 1494.0 (TID 88078) in 16482 ms on ip-172-31-35-59.us-west-2.compute.internal (executor 27) (879/4805)
18/07/16 12:02:26 INFO BlockManagerInfo: Added broadcast_2328_piece0 in memory on ip-172-31-36-42.us-west-2.compute.internal:34133 (size: 28.8 KB, free: 2.8 GB)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(20, ip-172-31-36-42.us-west-2.compute.internal, 34133, None),broadcast_2328_piece0,StorageLevel(memory, 1 replicas),29537,0))
18/07/16 12:02:26 INFO BlockManagerInfo: Added broadcast_2328_piece0 in memory on ip-172-31-47-55.us-west-2.compute.internal:45758 (size: 28.8 KB, free: 2.8 GB)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(16, ip-172-31-47-55.us-west-2.compute.internal, 45758, None),broadcast_2328_piece0,StorageLevel(memory, 1 replicas),29537,0))
18/07/16 12:02:26 INFO DAGScheduler: Job 1494 failed: toPandas at analysis_script.py:267, took 479.895614 s
18/07/16 12:02:26 INFO DAGScheduler: ShuffleMapStage 1494 (toPandas at analysis_script.py:267) failed in 478.993 s due to Stage cancelled because SparkContext was shut down
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerSQLExecutionEnd(0,1531742546839)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@28e5b10c)
18/07/16 12:02:26 INFO DAGScheduler: ShuffleMapStage 1495 (toPandas at analysis_script.py:267) failed in 479.270 s due to Stage cancelled because SparkContext was shut down
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@6b68c419)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(1494,1531742546841,JobFailed(org.apache.spark.SparkException: Job 1494 cancelled because SparkContext was shut down))
18/07/16 12:02:26 INFO YarnAllocator: Driver requested a total number of 0 executor(s).
18/07/16 12:02:26 INFO YarnClusterSchedulerBackend: Shutting down all executors
18/07/16 12:02:26 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
18/07/16 12:02:26 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices(serviceOption=None, services=List(),started=false)
18/07/16 12:02:26 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!

我找不到有关退出代码 143 的太多有用信息。有人知道为什么会发生此错误吗?谢谢。

最佳答案

当退出代码超过 128 时,Spark 会传递退出代码,这通常是 JVM 错误的情况。在退出代码 143 的情况下,它表示 JVM 收到了 SIGTERM - 本质上是一个 unix 终止信号 ( see this post for more exit codes and an explanation )。有关 Spark 退出代码的其他详细信息 can be found in this question .

由于您自己没有终止此操作,因此我首先怀疑是外部其他原因造成的。鉴于作业启动和发出 SIGTERM 之间正好相隔 8 分钟,EMR 本身似乎更有可能强制执行最大作业运行时间/集群生命周期。尝试检查您的 EMR 设置,看看是否有任何此类超时设置 - 在我的情况下有一个(在 AWS Glue 上,但概念相同)。

关于python - AWS-EMR 错误退出代码 143,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51368393/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com