gpt4 book ai didi

Any reason to set optimizeWrite = 'true', when autoCompact = 'auto' and no partitions are used(当自动压缩=‘AUTO’且未使用分区时,是否有理由设置OptimizeWrite=‘True’)

转载 作者:bug小助手 更新时间:2023-10-28 22:04:46 24 4
gpt4 key购买 nike



I work with time series data and Ingestion-Time Clustering (no partitioning) has proven to work well. In the Databricks docs is written "Optimized writes are most effective for partitioned tables, as they reduce the number of small files written to each partition."

我使用的是时间序列数据,摄取时间集群(无分区)已经被证明工作得很好。在数据库文档中写道:“优化的写入对于分区表最有效,因为它们减少了写入每个分区的小文件的数量。”


Without any partitions, and with autoCompact='auto, I am wondering if there is any benefit of setting optimizeWrite = 'true',? The smaller file sizes that are written to disc directly from the executors will anyways be compacted with autoCompact. Is my understanding correct?

在没有任何分区的情况下,在AutoComp=‘AUTO’的情况下,我想知道设置OptimizeWrite=‘true’,是否有任何好处?无论如何,从执行器直接写入光盘的较小文件大小将使用自动压缩进行压缩。我的理解正确吗?


更多回答
优秀答案推荐


yes, they could be compacted, but you need to take into account this statement from the docs:

是的,它们可以压缩,但您需要考虑文档中的以下声明:



Auto compaction occurs after a write to a table has succeeded and runs synchronously on the cluster that has performed the write.



So the write time could increase. Plus, the weakest point of both auto-compact & optimized write is that they don't collocate the data like it's done with OPTIMIZE ZORDER BY ....

因此写入时间可能会增加。此外,自动压缩和优化写入的最大弱点是它们不能像Optimize ZORDER by...那样配置数据。


But right now I would recommend to look onto Liquid Clustering (doc, blog post) - it could be better from performance standpoint than both auto-compact and automatic or explicit optimization.

但现在,我建议使用Liquid集群(文档、博客文章)--从性能的角度来看,它可能比自动压缩和自动或显式优化都要好。


更多回答

Thanks. I am reading ordered time series from a source DB on a batch process. Do you think without optimize write, the files would not be saved to disc in a sorted sequential order (if each executor has a different subset of the data)? Maybe optimize write would preserve the order of the data that is queried since everything is collected together before a write?

谢谢。我正在从批处理上的源数据库读取已排序的时间序列。您是否认为,如果没有优化写入,文件将不会按排序顺序保存到磁盘(如果每个执行器都有不同的数据子集)?也许优化写入可以保持查询数据的顺序,因为所有数据都是在写入之前收集在一起的?

databricks.com/blog/2022/11/18/… may help if you have data clustered during ingestion.

数据库/博客/2022年11月18日/…如果您在摄取过程中将数据聚集在一起,可能会有所帮助。

yes ITC is what I'm using as per original question

是的,ITC是我根据原始问题使用的

@AlexOtt: Is it possible to use liquid clustering with the pyspark writestream api? And if not, how does one actually create a liquid clustered delta table from a python streaming dataframe?

@AlexOtt:有没有可能在pyspark writestream api中使用liquid clustering?如果没有,那么如何从python流框架创建一个liquid集群delta表呢?

You need to set cluster by on the non-partitioned table: docs.databricks.com/en/delta/clustering.html

您需要在未分区的表上设置CLUSTER BY:docs.rabricks.com/en/Delta/clustering.html

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com