gpt4 book ai didi

databricks - 用于批量增量处理的 Delta 实时表

转载 作者:行者123 更新时间:2023-12-05 05:38:09 34 4
gpt4 key购买 nike

是否可以使用 Delta Live Tables 来执行增量批处理?

现在,我相信这段代码将始终在运行管道时加载目录中的所有可用数据,

CREATE LIVE TABLE lendingclub_raw
COMMENT "The raw loan risk dataset, ingested from /databricks-datasets."
TBLPROPERTIES ("quality" = "bronze")
AS SELECT * FROM parquet.`/databricks-datasets/samples/lending_club/parquet/`

但是,如果我们这样做,

CREATE LIVE TABLE lendingclub_raw
COMMENT "The raw loan risk dataset, ingested from /databricks-datasets."
TBLPROPERTIES ("quality" = "bronze")
AS SELECT * cloud_files("/databricks-datasets/samples/lending_club/parquet/", "parquet")

如果管道以触发模式运行,它是否会在每次运行时只加载增量数据?

我知道在Auto Loader中可以通过.trigger(once=True)或者.trigger(availableNow=True)的触发方式实现批量增量处理,按计划运行管道。

由于您无法在 DLT 中准确定义触发器,这将如何工作?

最佳答案

您需要将您的表定义为实时流式传输,这样它将只处理自上次调用后到达的数据。来自 docs :

A streaming live table or view processes data that has been added only since the last pipeline update.

然后它可以与行为类似于 Trigger.AvailableNow 的触发执行相结合。来自 docs :

Triggered pipelines update each table with whatever data is currently available and then stop the cluster running the pipeline. Delta Live Tables automatically analyzes the dependencies between your tables and starts by computing those that read from external sources. Tables within the pipeline are updated after their dependent data sources have been updated.

关于databricks - 用于批量增量处理的 Delta 实时表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/73002747/

34 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com