airflow - 为什么我的 Airflow 任务是 "externally set to failed"？-6ren

airflow - 为什么我的 Airflow 任务是 "externally set to failed"？

转载作者：行者123 更新时间：2023-12-04 12:18:01

我正在使用 Airflow 2.0.0，我的任务在运行几秒钟或几分钟后偶尔会被“外部”杀死。这些任务通常会成功运行(对于通过 airflow tasks test ... 启动的手动任务和计划的 DAG 运行)，所以我相信这与我的 DAG 代码无关。
当任务失败时，这似乎是任务日志中的关键错误:

{local_task_job.py:170} WARNING - State of this instance has been externally set to failed. Terminating instance.

[2020-12-20 11:26:11,448] {taskinstance.py:826} INFO - Dependencies all met for <TaskInstance: daily_backups.run_backupper 2020-12-19T02:00:00+00:00 [queued]>
[2020-12-20 11:26:11,473] {taskinstance.py:826} INFO - Dependencies all met for <TaskInstance: daily_backups.run_backupper 2020-12-19T02:00:00+00:00 [queued]>
[2020-12-20 11:26:11,473] {taskinstance.py:1017} INFO - 
--------------------------------------------------------------------------------
[2020-12-20 11:26:11,473] {taskinstance.py:1018} INFO - Starting attempt 3 of 3
[2020-12-20 11:26:11,473] {taskinstance.py:1019} INFO - 
--------------------------------------------------------------------------------
[2020-12-20 11:26:11,506] {taskinstance.py:1038} INFO - Executing <Task(PythonOperator): run_backupper> on 2020-12-19T02:00:00+00:00
[2020-12-20 11:26:11,509] {standard_task_runner.py:51} INFO - Started process 12059 to run task
[2020-12-20 11:26:11,515] {standard_task_runner.py:75} INFO - Running: ['airflow', 'tasks', 'run', 'daily_backups', 'run_backupper', '2020-12-19T02:00:00+00:00', '--job-id', '22', '--pool', 'default_pool', '--raw', '--subdir', 'DAGS_FOLDER/backupper/daily_backups.py', '--cfg-path', '/tmp/tmpnfmqtorg']
[2020-12-20 11:26:11,517] {standard_task_runner.py:76} INFO - Job 22: Subtask run_backupper
[2020-12-20 11:26:11,609] {logging_mixin.py:103} INFO - Running <TaskInstance: daily_backups.run_backupper 2020-12-19T02:00:00+00:00 [running]> on host localhost
[2020-12-20 11:26:11,742] {taskinstance.py:1232} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=<user>
AIRFLOW_CTX_DAG_ID=daily_backups
AIRFLOW_CTX_TASK_ID=run_backupper
AIRFLOW_CTX_EXECUTION_DATE=2020-12-19T02:00:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2020-12-19T02:00:00+00:00
...
... my job's logs, indicating that the job is running healthily ...
...
[2020-12-20 11:26:16,587] {local_task_job.py:170} WARNING - State of this instance has been externally set to failed. Terminating instance.
[2020-12-20 11:26:16,593] {process_utils.py:95} INFO - Sending Signals.SIGTERM to GPID 12059
[2020-12-20 11:27:16,609] {process_utils.py:108} WARNING - process psutil.Process(pid=12059, name='airflow task runner: daily_backups run_backupper 2020-12-19T02:00:00+00:00 22', status='sleeping', started='11:26:11') did not respond to SIGTERM. Trying SIGKILL
[2020-12-20 11:27:16,618] {process_utils.py:61} INFO - Process psutil.Process(pid=12059, name='airflow task runner: daily_backups run_backupper 2020-12-19T02:00:00+00:00 22', status='terminated', exitcode=<Negsignal.SIGKILL: -9>, started='11:26:11') (12059) terminated with exit code Negsignal.SIGKILL
[2020-12-20 11:27:16,618] {local_task_job.py:118} INFO - Task exited with return code Negsignal.SIGKILL

日志中的最后几行不一致。这是一个不同的版本，用于在早期尝试中失败的同一任务:

... same stuff as before ...
[2020-12-20 02:01:12,689] {local_task_job.py:170} WARNING - State of this instance has been externally set to failed. Terminating instance.
[2020-12-20 02:01:12,695] {process_utils.py:95} INFO - Sending Signals.SIGTERM to GPID 24442
[2020-12-20 02:02:00,462] {taskinstance.py:1214} ERROR - Received SIGTERM. Terminating subprocesses.
[2020-12-20 02:02:00,498] {process_utils.py:61} INFO - Process psutil.Process(pid=24442, status='terminated', exitcode=0, started='02:00:10') (24442) terminated with exit code 0
[2020-12-20 02:02:00,499] {local_task_job.py:118} INFO - Task exited with return code 0

我怀疑在这种情况下，脚本能够及时响应 SIGTERM，而在前一种情况下，它在长时间运行的查询中被阻止并且无法干净地终止。

最佳答案

我相信问题在于 调度程序健康检查阈值设置为小于调度程序心跳间隔。
在我的配置中，我设置了 scheduler_health_check_threshold到 30 秒和 scheduler_heartbeat_sec到 60 秒。在检查孤立任务(本身由不同的参数 orphaned_tasks_check_interval 控制)期间，调度程序心跳被确定为超过 30 秒，这是有道理的，因为它每 60 秒才心跳一次。因此，调度程序被推断为不健康并因此被终止。
大约在失败的时候，我可以在 /var/log/syslog 中看到这样的消息

Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,368] {scheduler_job.py:1751} INFO - Resetting orphaned tasks for active dag runs
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,373] {scheduler_job.py:1764} INFO - Marked 1 SchedulerJob instances as failed
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,381] {scheduler_job.py:1805} INFO - Reset the following 1 orphaned TaskInstances:
Dec 20 11:26:14 localhost bash[11545]: #011<TaskInstance: daily_backups.run_backupper 2020-12-19 02:00:00+00:00 [running]>
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,571] {scheduler_job.py:938} INFO - 1 tasks up for execution:
Dec 20 11:26:14 localhost bash[11545]: #011<TaskInstance: daily_backups.run_backupper 2020-12-19 02:00:00+00:00 [scheduled]>
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,574] {scheduler_job.py:972} INFO - Figuring out tasks to run in Pool(name=default_pool) with 128 open slots and 1 task instances ready to be queued
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,575] {scheduler_job.py:999} INFO - DAG daily_backups has 0/16 running and queued tasks
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,575] {scheduler_job.py:1060} INFO - Setting the following tasks to queued state:
Dec 20 11:26:14 localhost bash[11545]: #011<TaskInstance: daily_backups.run_backupper 2020-12-19 02:00:00+00:00 [scheduled]>
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,578] {scheduler_job.py:1102} INFO - Sending TaskInstanceKey(dag_id='daily_backups', task_id='run_backupper', execution_date=datetime.datetime(2020, 12, 19, 2, 0, tzinfo=Timezone('UTC')), try_number=4) to executor with priority 2 and queue default
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,578] {base_executor.py:79} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'daily_backups', 'run_backupper', '2020-12-19T02:00:00+00:00', '--local', '--pool', 'default_pool', '--subdir', '/storage/airflow/dags/backupper/daily_backups.py']
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,581] {local_executor.py:81} INFO - QueuedLocalWorker running ['airflow', 'tasks', 'run', 'daily_backups', 'run_backupper', '2020-12-19T02:00:00+00:00', '--local', '--pool', 'default_pool', '--subdir', '/storage/airflow/dags/backupper/daily_backups.py']
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,707] {dagbag.py:440} INFO - Filling up the DagBag from /storage/airflow/dags/backupper/daily_backups.py
Dec 20 11:26:15 localhost bash[11545]: Running <TaskInstance: daily_backups.run_backupper 2020-12-19T02:00:00+00:00 [queued]> on host localhost

并且时间戳与我的任务收到的 SIGTERM 非常吻合。我猜因为 SchedulerJob 被标记为失败，那么运行我的实际任务的 TaskInstance 被认为是孤立的，因此被标记为终止。同时它安排了新的尝试( try_number=4 )。
增加 scheduler_health_check_threshold到 120 秒并重新启动调度程序/网络服务器服务似乎解决了我的问题。

关于airflow - 为什么我的 Airflow 任务是 "externally set to failed"？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/65380492/

文章推荐： arrays - 在 Julia 中仅用另一个数组的值替换数组中的零

fail-fast-fail-early - 表达式 "Fail Early"是什么意思，你想什么时候这样做？
“Fail Early”是什么意思，在什么情况下这种方法最有用，你什么时候会避免这种方法？最佳答案本质上，快速失败 (又名尽早失败 )是对您的软件进行编码，使得当出现问题时，软件会尽快并尽可能
c - 警告 : espcomm_sync failed error: espcomm_open failed error: espcomm_upload_mem failed
/* * 115200. Connect GPIO 0 of your ESP8266 to VCC and reset the board */ #include #include #inc
macos - 致命的 : Failed to start gitlab-runner: "launchctl" failed with stderr: Load failed: 5: Input/output error
安装并注册 gitlab-runner 后，当我运行时 gitlab-runner start我收到此错误消息。这是什么原因？ Runtime platform
windows-server - Windows容器无法启动，错误为 "failed to create endpoint on network nat: HNS failed with error : Failed to create endpoint."
我一直在尝试Windows Server 2016 TP5上的Windows容器。突然我在运行带有端口映射选项-p 80:80的容器时开始出错 c:\>docker run -it -p 80:80
hyperledger-fabric - 错误 : failed to create deliver client: orderer client failed to connect to orderer: failed to create new connection: context deadline exceeded
我一直在关注 Hyperledger Fabric Multi-Org setup 的教程，我能够成功地做到这一点。现在我想根据我想要的组织名称对其进行自定义，并且在尝试连接网络时遇到以下错误。希望有
lisp - 五上午 : fail to understand why this test fails
所以我不知道为什么这个测试失败了。当我运行 repl 中的语句时，一切似乎都正常工作，但 fiveam 测试失败。以下要点中有一个测试用例:https://gist.github.com/Puerc
安卓工作室 : execution failed for task : app:compileDebugAidl FAILED
我安装了 Android Studio 1.2.1.1、Gradle 版本 2.2.1 和 Android 插件版本 1.2.3。我试图创建一个简单的 hello world 项目，它给了我一个构建失
php - 交响乐 4 : WebTestCase fails (Failed to start the session)
我正在尝试设置一个简单的 WebTestCase，它使用 Symfony 4(和 "phpunit/phpunit": "^6.5")。但是，测试失败: Failed to start the ses
javascript - MarkCompactCollector : young object promotion failed Allocation failed
我已经使用 git clone 在本地克隆了一个包含 Vue 项目的 git 存储库. 然后我跑了npm install安装依赖项并获得 node_modules文件夹。正在运行 npm run s
Github Windows : Commit failed: Failed to create a new commit
我有:http://windows.github.com/ 我当前的项目有大约 20k 个文件，大约 150MB(并且不说它有多慢而且我现在什么也做不了)它甚至不允许我提交!我收到此错误:提交失败:无
安卓蓝牙 : "Scan failed, reason app registration failed for UUID"
我正在使用 RxAndroidBle 库开发一个应用程序，该库大约每 30 秒定期执行 BLE 扫描，每分钟左右执行一些 BLE 操作。几个小时后，通常在 5 到 24 小时之间，扫描停止工作。每次应
windows - Pycharm GitHub 'Push failed: fatal: Authentication failed'
每次我尝试使用 Pycharm 推送 GitHub 中的存储库时，它都会失败。 Push failed: fatal: Authentication failed for 'https://githu
java - resque :failed and resque:stat:failed keys?有什么区别
此外，管理内置“管理结构”(如标题中的结构)的 Resque 的最佳实践是什么？我应该用 jedis.del(String key) 或类似的东西清除它们吗？最佳答案 resque:failed 是
javascript - jQuery when/then/fail with concurrent ajax requests : Which request failed?
想象这样一种场景，我们想要在对“foo”和“bar”的并发请求成功完成后做一些事情，或者如果其中一个或两个失败则报告错误: $.when($.getJSON('foo'), $.getJSON('ba
python - cx_Oracle : ImportError: DLL load failed: This application has failed
这就是我所做的: 我使用的是 Windows XP SP3 我已经安装了 Python 2.7.1。我下载了instantclient-basic-nt-11.2.0.3.0.zip，解压后放入C:
php - vfsstream : file_get_contents() failed to open stream: stream_open call failed
我已经设置了一个 vfsstream block 设备，我正在尝试对其调用 file_get_contents()。然而，对 vfsStreamWrapper::stream_open 的调用失败，因
javascript - 类型错误 : Failed to execute 'createObjectURL' on 'URL' : Overload resolution failed
我正在尝试在我的 React 应用程序中使用文件上传功能，但遇到了问题。当我尝试上传第一张图片时，它工作得很好。文件资源管理器对话框关闭并显示我的图片。用我的文件资源管理器中的另一张图片覆盖图片也可以
mongodb - mongoexport 错误 : Failed: Failed to parse + Unrecognized field 'snapshot
目标:将我的本地 mongodb 数据迁移到 mongodb atlas 集群。尝试: 1.将本地数据导出为json。 2.导入json到集群。操作系统:Linuxmint 19.1 Cinnam
GCE : connection failed because connected host has failed to respond 上的 Python
我一直在从事一个需要在服务器(托管在 GCE 上)和多个客户端之间进行一些网络连接的项目。我创建了一个 Compute Engine 实例来运行 Python 脚本，如以下视频所示:https://w
postgresql - 错误 : failed to connect to database: password authentication failed in Rust
我正在尝试使用 sqlx crate 和 Postgres 数据库连接到 Rust 中的数据库。 main.rs: use dotenv; use sqlx::Pool; use sqlx::PgPo

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

airflow - 为什么我的 Airflow 任务是 "externally set to failed"？