clickhouse - Clickhouse Buffer Table 是否适合实时摄取许多小插入？-6ren

clickhouse - Clickhouse Buffer Table 是否适合实时摄取许多小插入？

转载作者：行者123 更新时间：2023-12-05 02:39:23

26

4

我正在编写一个应用程序来绘制财务数据并与此类数据的实时源进行交互。由于任务的性质，可能会以一次一次交易的方式非常频繁地接收实时市场数据。我在本地使用数据库，而且我是唯一的用户。只有一个程序(我的中间件)会将数据插入数据库。我主要关心的是延迟——我想尽可能地减少它。出于这个原因，我想避免有一个队列(在某种意义上，我希望缓冲区表来完成这个角色)。 Clickhouse 为我计算的很多分析预计也是实时的(尽可能多)。我有三个问题:

澄清缓冲表文档中的一些限制/注意事项
阐明查询的工作原理(常规查询 + 物化 View )
当我在刷新数据时查询数据库会发生什么

问题 1) 澄清缓冲表文档中的一些限制/注意事项

根据 Clickhouse 文档，我了解到许多小型 INSERT 至少可以说是次优的。在研究该主题时，我发现缓冲引擎 [1] 可以用作解决方案。这对我来说很有意义，但是当我阅读 Buffer 的文档时，我发现了一些注意事项:

Note that it does not make sense to insert data one row at a time, even for Buffer tables. This will only produce a speed of a few thousand rows per second, while inserting larger blocks of data can produce over a million rows per second (see the section “Performance”).

每秒几千行对我来说绝对没问题，但我担心其他性能方面的考虑 - 如果我一次一行地将数据提交到缓冲表，我是否应该期望 CPU/内存出现峰值？如果我理解正确，一次将一行提交到 MergeTree 表会导致合并作业的大量额外工作，但如果使用 Buffer Table 应该不是问题，对吗？

If the server is restarted abnormally, the data in the buffer is lost.

我知道这是指停电或计算机崩溃之类的事情。如果我正常关机或正常停止clickhouse服务器，我是否可以期望缓冲区将数据刷新到目标表？

问题 2) 阐明查询的工作原理(常规查询 + 物化 View )

When reading from a Buffer table, data is processed both from the buffer and from the destination table (if there is one).Note that the Buffer tables does not support an index. In other words, data in the buffer is fully scanned, which might be slow for large buffers. (For data in a subordinate table, the index that it supports will be used.)

这是否意味着我可以对目标表使用查询并期望自动包含缓冲区表数据？还是反过来——我查询缓冲表，目标表包含在后台？如果其中任何一个为真(并且我不需要手动聚合两个表)，这是否也意味着将填充物化 View ？哪个表应该触发物化 View ——磁盘表还是缓冲表？或者两者兼而有之，以某种方式？

我非常依赖物化 View 并需要它们实时更新(或尽可能接近)。实现该目标的最佳策略是什么？

问题 3) 当我在刷新数据时查询数据库会发生什么？

我的两个主要问题是:

在刷新发生的确切时间运行查询 - 是否存在记录重复或遗漏记录的风险？
目标表的实体化 View 在哪个点被填充(我想这取决于触发 MV 的是目标表还是缓冲表)？刷新缓冲区对我如何构造 MV 很重要吗？

感谢您的宝贵时间。

[1] https://clickhouse.tech/docs/en/engines/table-engines/special/buffer/

最佳答案

A few thousand rows per second is absolutely fine for me, however I amconcerned about other performance considerations - if I do commit datato the buffer table one row at a time, should I expect spikes inCPU/memory?

无缓冲区表引擎不会产生 CPU\内存峰值

If I understand correctly, committing one row at a time toa MergeTree table would cause a lot of additional work for the mergingjob, but it should not be a problem if Buffer Table is used, correct?

缓冲表引擎用作内存缓冲区，定期将批量行刷新到底层 *MergeTree 表，缓冲表的参数是刷新的大小和频率

If I shutdown the computer normally or stop the clickhouse server normally, can I expect the buffer to flush data to the target table?

是的，当服务器正常停止时，Buffer tables 会刷新它们的数据。

I query the buffer table and the target table is included in the background?

是的，这是正确的行为，当您从 Buffer 表中进行 SELECT 时，SELECT 也会传递到底层 *MergeTree 表中，并且刷新的数据将从 *MergeTree 中读取

does that also mean Materialized Views would be populated?

不清楚，CREATE MATERIALIZED VIEW as trigger FROM *MergeTree table 还是 trigger FROM the Buffer 表，以及你将哪个表引擎用于 TO table 子句？

我建议 CREATE MATERIALIZED VIEW 作为底层 MergeTree 表的触发器

关于clickhouse - Clickhouse Buffer Table 是否适合实时摄取许多小插入？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/69147028/

26

4

0

文章推荐： python - 如何为绘图条形图中的某些条设置特定颜色？

文章推荐： javascript - fireEvent.click 后组件未更新

Pinot 嵌套 json 摄取
我有这个 json 模式 { "name":"Pete" "age":24, "subjects":[ { "name":"maths" "grade":"
python - xarray 波周期以秒为单位作为 timedelta64 摄取
测量海浪周期的变量的“单位”属性以“秒”为单位。这不是日期时间字段，但 xarray 会自动将此变量作为 timedelta64 摄取。由于单位不是“自...以来的秒数”，我会假设 xarray 应该
java - 使用 geomesa-accumulo 摄取 GeoTIFF
我尝试使用 geomesa-accumulo 摄取 geotiff 数据，但出现以下错误: WARNING: Failed to load the GDAL native libs. This is
javascript - MongoDB javascript 摄取 JSON 字符串
我有一个很大的 JSON 字符串，包含 10 条记录，每条记录都有自己的属性。我需要使用 Javascript 将它们提取到我的 MongoDB 中。我对 Javascript 基本上没什么用，谷歌也
node.js - MongoDB 摄取 ETL 设计选项
在谈到 MongoDB 时，我完全是个新手，但我以前确实有使用 Hbase 和 Accumulo 等 nosql 存储的经验。当我使用这些其他 nosql 平台时，我最终编写了自己的数据摄取框架(通常
objective-c - RTMP 摄取 block 流的问题
我正在尝试为我正在开发的应用构建我自己的客户端 RTMP 库。到目前为止，一切都非常成功，因为我能够连接到 RTMP 服务器协商握手，然后发送所有必要的数据包(FCPublish Publish ET
python - 摄取 Null Int 列 : Pandas and Pandera
我将 pandas 与 pandera 一起用于模式验证，但我遇到了一个问题，因为数据中有一个空整数列。 from prefect import task, Flow #type:i
python - 摄取 Null Int 列 : Pandas and Pandera
我将 pandas 与 pandera 一起用于模式验证，但我遇到了一个问题，因为数据中有一个空整数列。 from prefect import task, Flow #type:i
java - 如何使用 Spring Boot 摄取 Json 字符串数组
我无法在网络服务中正确读取输入 JSON 文件。我正在尝试将一些输入参数从简单的字符串更改为字符串数组我的输入 JSON 看起来像这样: { "inputParams" : { "speck
split - 如何拆分 CSV 或 JSON 文件以实现最佳 Snowflake 摄取？
Snowflake 建议在摄取之前拆分大文件: To optimize the number of parallel operations for a load, we recommend aimin
ffmpeg - 如何使用 execv 执行 ffmpeg 摄取 rtmp 流
我可以在linux中成功执行以下命令: ffmpeg -i "rtmp://42.62.95.48/live?vhost=hls/livestream timeout=2" -vcodec copy
java - 尝试批量/摄取 "large"数量的文档 SQL Db 到 Elasticsearch
您好，我需要从数据库中读取多个表并连接这些表。一旦表加入，我想将它们推送到 Elasticsearch。这些表是从外部进程连接的，因为数据可以来自多个源。这不是问题，事实上我有 3 个单独的进程以平
hadoop - 根据 Kafka 的消息数据写入自定义 HDFS 目录 -> Flume -> hdfs 摄取
如何根据 Kafka 消息中的消息类型使用水槽写入自定义 hdfs 目录？说 kafka 消息:{"type": "A", "data": "blah"} 在类型字段中有 "A"应该写入 /data
google-bigquery - 如何在 BigQuery 插入错误时崩溃/停止 DataFlow Pub/Sub 摄取
我正在寻找一种方法，使 Google DataFlow 作业在(特定)异常发生时停止从 Pub/Sub 摄取。来自 Pub/Sub 的事件是通过 PubsubIO.Read.Bound 读取的 JS
mongodb - 当我运行 docker compose 时，我的 golang(摄取)容器无法显示 "Error establishing Mongo session"
我运行了一个 docker-compose up，我在我的 golang 容器上收到一条错误消息，提示“Error establishing Mongo session”，然后容器退出。我不确定问题是

首页

博学

6Ren·AI

商城

clickhouse - Clickhouse Buffer Table 是否适合实时摄取许多小插入？