On airflow v2.6.3, I run daily spark jobs. Every once in a while, a successful job in the dag will retry the job through an EMRSensor. That retried job will show a "State of this instance has been externally set to up_for_retry. Terminating instance." error followed by "ERROR - Received SIGTERM. Terminating subprocesses". This causes the rest of the DAG to have an upstream failure. The initial EMRSensor job doesn't show a "1 downstream tasks scheduled from follow-on schedule task" message after detecting a successful job. It simply retries the job.
在Airflow v2.6.3上,我每天运行Spark作业。每隔一段时间,DAG中的成功作业都会通过EMR传感器重试该作业。重试的作业将显示“此实例的状态已在外部设置为up_for_retry。正在终止实例。”错误后接“Error-Receied SIGTERM.Terminating SubProcess”(收到错误信号。终止子进程)。这会导致DAG的其余部分出现上游故障。在检测到一个成功的作业后,初始EMR传感器作业没有显示“从后续调度任务调度的1个下游任务”消息。它只是重试该作业。
Previously, on airflow v2.2.2, this occurred around 5 times a week on daily jobs. Upgrading from v2.2.2 to v2.6.3 resulted in this error not showing until a week later.
以前,在Airflow v2.2.2中,这在日常工作中每周大约发生5次。从v2.2.2升级到v2.6.3导致此错误在一周后才会显示。
Something I noticed as well was for the EMRSensor jobs that are detect successful jobs but don't continue to upstream jobs, the airflow logs don't have some starting and ending lines that the other successful and continued jobs have:
我还注意到,对于检测到成功作业但没有继续上游作业的EMR传感器作业,气流日志没有其他成功和继续作业所具有的一些开始和结束行:
Similarly, at the end of the airflow log, the last 2 lines are omitted:
同样,在气流记录的末尾,省略了最后两行:
Does anyone know what's going on? Seems like a heartbeat spark timeout issue since its occurring sporadically.
有人知道这是怎么回事吗?看起来像是心跳火花超时问题,因为它偶尔会发生。
The airflow DAG is triggering sparksubmit jobs on an EMR node on EC2.
气流DAG正在触发EC2上EMR节点上的触发提交作业。
更多回答
优秀答案推荐
我是一名优秀的程序员,十分优秀!