gpt4 book ai didi

airflow - 为什么我的 Airflow 任务是 "externally set to failed"?

转载 作者:行者123 更新时间:2023-12-04 12:18:01 26 4
gpt4 key购买 nike

我正在使用 Airflow 2.0.0,我的任务在运行几秒钟或几分钟后偶尔会被“外部”杀死。这些任务通常会成功运行(对于通过 airflow tasks test ... 启动的手动任务和计划的 DAG 运行),所以我相信这与我的 DAG 代码无关。
当任务失败时,这似乎是任务日志中的关键错误:

{local_task_job.py:170} WARNING - State of this instance has been externally set to failed. Terminating instance.
[2020-12-20 11:26:11,448] {taskinstance.py:826} INFO - Dependencies all met for <TaskInstance: daily_backups.run_backupper 2020-12-19T02:00:00+00:00 [queued]>
[2020-12-20 11:26:11,473] {taskinstance.py:826} INFO - Dependencies all met for <TaskInstance: daily_backups.run_backupper 2020-12-19T02:00:00+00:00 [queued]>
[2020-12-20 11:26:11,473] {taskinstance.py:1017} INFO -
--------------------------------------------------------------------------------
[2020-12-20 11:26:11,473] {taskinstance.py:1018} INFO - Starting attempt 3 of 3
[2020-12-20 11:26:11,473] {taskinstance.py:1019} INFO -
--------------------------------------------------------------------------------
[2020-12-20 11:26:11,506] {taskinstance.py:1038} INFO - Executing <Task(PythonOperator): run_backupper> on 2020-12-19T02:00:00+00:00
[2020-12-20 11:26:11,509] {standard_task_runner.py:51} INFO - Started process 12059 to run task
[2020-12-20 11:26:11,515] {standard_task_runner.py:75} INFO - Running: ['airflow', 'tasks', 'run', 'daily_backups', 'run_backupper', '2020-12-19T02:00:00+00:00', '--job-id', '22', '--pool', 'default_pool', '--raw', '--subdir', 'DAGS_FOLDER/backupper/daily_backups.py', '--cfg-path', '/tmp/tmpnfmqtorg']
[2020-12-20 11:26:11,517] {standard_task_runner.py:76} INFO - Job 22: Subtask run_backupper
[2020-12-20 11:26:11,609] {logging_mixin.py:103} INFO - Running <TaskInstance: daily_backups.run_backupper 2020-12-19T02:00:00+00:00 [running]> on host localhost
[2020-12-20 11:26:11,742] {taskinstance.py:1232} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=<user>
AIRFLOW_CTX_DAG_ID=daily_backups
AIRFLOW_CTX_TASK_ID=run_backupper
AIRFLOW_CTX_EXECUTION_DATE=2020-12-19T02:00:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2020-12-19T02:00:00+00:00
...
... my job's logs, indicating that the job is running healthily ...
...
[2020-12-20 11:26:16,587] {local_task_job.py:170} WARNING - State of this instance has been externally set to failed. Terminating instance.
[2020-12-20 11:26:16,593] {process_utils.py:95} INFO - Sending Signals.SIGTERM to GPID 12059
[2020-12-20 11:27:16,609] {process_utils.py:108} WARNING - process psutil.Process(pid=12059, name='airflow task runner: daily_backups run_backupper 2020-12-19T02:00:00+00:00 22', status='sleeping', started='11:26:11') did not respond to SIGTERM. Trying SIGKILL
[2020-12-20 11:27:16,618] {process_utils.py:61} INFO - Process psutil.Process(pid=12059, name='airflow task runner: daily_backups run_backupper 2020-12-19T02:00:00+00:00 22', status='terminated', exitcode=<Negsignal.SIGKILL: -9>, started='11:26:11') (12059) terminated with exit code Negsignal.SIGKILL
[2020-12-20 11:27:16,618] {local_task_job.py:118} INFO - Task exited with return code Negsignal.SIGKILL
日志中的最后几行不一致。这是一个不同的版本,用于在早期尝试中失败的同一任务:
... same stuff as before ...
[2020-12-20 02:01:12,689] {local_task_job.py:170} WARNING - State of this instance has been externally set to failed. Terminating instance.
[2020-12-20 02:01:12,695] {process_utils.py:95} INFO - Sending Signals.SIGTERM to GPID 24442
[2020-12-20 02:02:00,462] {taskinstance.py:1214} ERROR - Received SIGTERM. Terminating subprocesses.
[2020-12-20 02:02:00,498] {process_utils.py:61} INFO - Process psutil.Process(pid=24442, status='terminated', exitcode=0, started='02:00:10') (24442) terminated with exit code 0
[2020-12-20 02:02:00,499] {local_task_job.py:118} INFO - Task exited with return code 0
我怀疑在这种情况下,脚本能够及时响应 SIGTERM,而在前一种情况下,它在长时间运行的查询中被阻止并且无法干净地终止。

最佳答案

我相信问题在于 调度程序健康检查阈值设置为小于调度程序心跳间隔。
在我的配置中,我设置了 scheduler_health_check_threshold到 30 秒和 scheduler_heartbeat_sec到 60 秒。在检查孤立任务(本身由不同的参数 orphaned_tasks_check_interval 控制)期间,调度程序心跳被确定为超过 30 秒,这是有道理的,因为它每 60 秒才心跳一次。因此,调度程序被推断为不健康并因此被终止。
大约在失败的时候,我可以在 /var/log/syslog 中看到这样的消息

Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,368] {scheduler_job.py:1751} INFO - Resetting orphaned tasks for active dag runs
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,373] {scheduler_job.py:1764} INFO - Marked 1 SchedulerJob instances as failed
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,381] {scheduler_job.py:1805} INFO - Reset the following 1 orphaned TaskInstances:
Dec 20 11:26:14 localhost bash[11545]: #011<TaskInstance: daily_backups.run_backupper 2020-12-19 02:00:00+00:00 [running]>
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,571] {scheduler_job.py:938} INFO - 1 tasks up for execution:
Dec 20 11:26:14 localhost bash[11545]: #011<TaskInstance: daily_backups.run_backupper 2020-12-19 02:00:00+00:00 [scheduled]>
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,574] {scheduler_job.py:972} INFO - Figuring out tasks to run in Pool(name=default_pool) with 128 open slots and 1 task instances ready to be queued
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,575] {scheduler_job.py:999} INFO - DAG daily_backups has 0/16 running and queued tasks
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,575] {scheduler_job.py:1060} INFO - Setting the following tasks to queued state:
Dec 20 11:26:14 localhost bash[11545]: #011<TaskInstance: daily_backups.run_backupper 2020-12-19 02:00:00+00:00 [scheduled]>
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,578] {scheduler_job.py:1102} INFO - Sending TaskInstanceKey(dag_id='daily_backups', task_id='run_backupper', execution_date=datetime.datetime(2020, 12, 19, 2, 0, tzinfo=Timezone('UTC')), try_number=4) to executor with priority 2 and queue default
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,578] {base_executor.py:79} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'daily_backups', 'run_backupper', '2020-12-19T02:00:00+00:00', '--local', '--pool', 'default_pool', '--subdir', '/storage/airflow/dags/backupper/daily_backups.py']
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,581] {local_executor.py:81} INFO - QueuedLocalWorker running ['airflow', 'tasks', 'run', 'daily_backups', 'run_backupper', '2020-12-19T02:00:00+00:00', '--local', '--pool', 'default_pool', '--subdir', '/storage/airflow/dags/backupper/daily_backups.py']
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,707] {dagbag.py:440} INFO - Filling up the DagBag from /storage/airflow/dags/backupper/daily_backups.py
Dec 20 11:26:15 localhost bash[11545]: Running <TaskInstance: daily_backups.run_backupper 2020-12-19T02:00:00+00:00 [queued]> on host localhost
并且时间戳与我的任务收到的​​ SIGTERM 非常吻合。我猜因为 SchedulerJob 被标记为失败,那么运行我的实际任务的 TaskInstance 被认为是孤立的,因此被标记为终止。同时它安排了新的尝试( try_number=4 )。
增加 scheduler_health_check_threshold到 120 秒并重新启动调度程序/网络服务器服务似乎解决了我的问题。

关于airflow - 为什么我的 Airflow 任务是 "externally set to failed"?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65380492/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com