gpt4 book ai didi

python - 通过 apache Airflow 检查文件是否存在于谷歌存储桶中?

转载 作者:行者123 更新时间:2023-12-05 01:58:17 24 4
gpt4 key购买 nike

我有一个 DAG,它获取 Google 云存储桶中脚本的结果,将其加载到 Google BigQuery 中的表中,然后删除存储桶中的文件。

我希望 DAG 在周末每小时检查一次。现在,我正在使用 GoogleCloudStoragetoBigQueryOperator。如果该文件不存在,DAG 将失败。有没有一种方法可以将 DAG 设置为在文件不存在时不会失败的位置?也许试一试?

最佳答案

你可以使用 GCSObjectExistenceSensor来自 Google 提供程序包,以便在运行下游任务之前验证文件是否存在。

gcs_object_exists = GCSObjectExistenceSensor(
bucket=BUCKET_1,
object=PATH_TO_UPLOAD_FILE,
mode='poke',
task_id="gcs_object_exists_task",
)

可以查看官方例子here .请记住,此传感器从 BaseSensorOperator 扩展而来,因此您可以定义参数,例如 poke_intervaltimeoutmode 以适应您的需求。

  • soft_fail (bool) – Set to true to mark the task as SKIPPED on failure
  • poke_interval (float) – Time in seconds that the job should wait in between each tries
  • timeout (float) – Time, in seconds before the task times out and fails.
  • mode (str) – How the sensor operates. Options are: { poke | reschedule }, default is poke. When set to poke the sensor is taking up a worker slot for its whole execution time and sleeps between pokes. Use this mode if the expected runtime of the sensor is short or if a short poke interval is required. Note that the sensor will hold onto a worker slot and a pool slot for the duration of the sensor’s runtime in this mode. When set to reschedule the sensor task frees the worker slot when the criteria is not yet met and it’s rescheduled at a later time. Use this mode if the time before the criteria is met is expected to be quite long. The poke interval should be more than one minute to prevent too much load on the scheduler.
  • exponential_backoff (bool) – allow progressive longer waits between pokes by using exponential backoff algorithm

source

关于python - 通过 apache Airflow 检查文件是否存在于谷歌存储桶中?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68628542/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com