google-bigquery - BigQuery : cost of querying tables partitioned by ingestion time vs date/timestamp partitioned-6ren

google-bigquery - BigQuery : cost of querying tables partitioned by ingestion time vs date/timestamp partitioned

转载作者：行者123 更新时间：2023-12-02 13:54:07

26

4

我们正在尝试基于 BigQuery 在云中构建(或者更好地说重建)我们的 DWH。我们决定对原始数据使用“按日期字段分区”表(如“created_date”字段)，而不是摄取时间分区，因为通过此功能，我们可以轻松加载数据，然后使用“group by”分区日期列进行查询，构建数据集市 bla bla bla。我们认为这种分区方法会提高查询速度并降低成本(与非分区表相比 - 是的)，但是我们发现，当您使用 WHERE 按分区字段查询表时(例如“select count(*) from table where”)创建日期=当前日期')，这会花钱。

我们使用 WHERE _PARTITIONTIME ='' 的旧式摄取时间分区表查询是免费的! (例如“从表中选择 count(*)，其中 _PARTITIONTIME=current_date”)

例如:

1) 从 table1 中选择 value1，其中 _PARTITIONTIME = current_date

2) 从 table1 中选择 value1，其中created_date = current_date

3) 从表 1 中选择 count(*)，其中 _PARTITIONTIME = current_date

第二个查询的成本更高，因为它将扫描 2 列。这是符合逻辑的。但不公平(((第三个查询是绝对免费的顺便说一句!

这是非常悲伤的情况，因为文档中没有任何关于此“副作用”的警告。这个功能旨在让数据库开发人员的生活更轻松(我猜)，它被定位为最佳实践功能，并受到 Google 的强烈推荐。但没有人说这也会花费你额外的钱!

所以问题是我们能否以某种方式免费使用分区键查询日期字段分区表？如果您使用基于日期/时间戳字段的分区，是否有任何其他伪列或按分区键过滤的方法可用？

(ps:如果日期/时间戳分区方法不存在，你们谷歌必须添加一些伪列)。

谢谢!

最佳答案

So the question is can we somehow query date-field partitioned tables using partition key for free?

答案是否定的，查询分区不是免费的。

Is there any other pseudocolumn or method of filtering by partition key available if you use date/timestamp field based partitioning?

如果您想要按日期分区，则只能使用 _PARTITIONTIME 伪列的摄取时间分区或使用选定日期/时间戳值列中的日期值来实现。目前没有可用的替代选项。请记住，分区的主要目标之一是主要通过减少扫描的行数来减少扫描的数据量。

You guys from google must add some pseudocolumn for the date/timestamp partition method if it does not exist

我知道您希望为数据列分区方法提供一些伪列，但是您能否在原始帖子中详细说明您希望在此分区中看到哪些值？

编辑:已代表您提出功能请求。您可以关注here

关于google-bigquery - BigQuery : cost of querying tables partitioned by ingestion time vs date/timestamp partitioned，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58343216/

26

4

0

文章推荐： .net - .NET Framework 4 的向后兼容性

文章推荐： grails - grails中保存和更新方法的成语

文章推荐： grails - 在grails Controller 集成测试中访问模型并查看信息

文章推荐： c# - ListBox多选拖放问题

Elasticsearch Ingest Node 与 Logstash 等
我有一个关于 Elasticsearch 策略的一般性问题。我们才刚刚起步，我们有一堆非结构化日志(30-40 个日志，~300GB 数据/周)。我对如何以最佳方式将这些数据导入 Elasticsea
马克逻辑 : Generate primary key while data ingestion
我正在使用 java-api 从 CSV 中提取数据。我必须维护每个文档的主键。 Does marklogic provide any unique auto-generated id during
elasticsearch - 什么是ELasticsearch Ingestion API中的logstash “mutate”等价物
我正在使用filebeat-5.2，logstash-5.2和AWS Elastic Search Service-5.1。在这里，我在logstash中定义了现有的grok模式 grok{ mat
elasticsearch - 性能问题:拒绝执行org.elasticsearch.ingest.PipelineExecutionSService
我一直在努力从Windows IIS日志中将5亿个文档从kafka传输到elasticsearch。在运输过程的开始，一切都很好。从Kafka-manager仪表板，我可以看到文档输出/字节的速度约
palantir-foundry - Foundry Magritte append ingestion 如何处理数据源中已删除的行？
如果我有一个设置为追加的 Magritte 摄取，它是否会检测源数据中的行是否被删除？它还会删除摄取的数据集中的行吗？最佳答案对于关于是否检测到删除的第一个问题，这将取决于您从中提取的数据库实现(
Azure 数据资源管理器 : How to backup data from table with stream ingestion
我们正在使用来自事件中心源的流提取将数据提取到 ADX 表。为了规划备份/灾难恢复能力，the documentation建议配置连续导出以从本地中断中恢复，并提供将数据恢复到另一个集群的可能性。
google-bigquery - BigQuery Streaming 插入的预期 'ingest' 时间是多少？
这听起来像是 this year-old issue 的重复但我想知道 BQ 团队是否对流式插入可能需要这么长时间有任何进一步的了解。 (我要花一个小时来处理仅仅 9K 行。) 不确定它是否相关，但我
architecture - 数据湖 : fix corrupted files on Ingestion vs ETL
客观的我正在构建数据湖，一般流程看起来像 Nifi -> Storage -> ETL -> Storage -> Data Warehouse。 Data Lake 的一般规则听起来像是在摄取阶段
c# - ADX : cannot ingest data with JSON mappings from . 网络 SDK
感谢这个recent question我现在确信我定义的表映射是正确的。这适用于查询面板: .ingest inline into table pageEvents with (format="js
elasticsearch - elasticsearch中的 “Token filter”和 “ingest node”有什么区别？
我是Elasticsearch的新手，对这两个术语感到困惑。 token 过滤器和摄取节点。它们都将 token 转换为另一种事物，例如小写 token 等，并且摄取节点也可以这样做。谁能解释给我什
c# - ADX : cannot ingest data with JSON mappings from . 网络 SDK
感谢这个recent question我现在确信我定义的表映射是正确的。这适用于查询面板: .ingest inline into table pageEvents with (format="js
javascript - MarkLogic 8 Ingestion Job 在 JavaScript 中对集合进行非规范化
我有 3 个来自关系数据库的数据库 View ，它们作为 3 个集合被引入 MarkLogic。这 3 个 View 是相互关联的。我想加入这些数据，然后将其全部提取到一个非规范化集合中。我可以
java - 如何在 Java 中使用 Elasticsearch Ingest Attachment Processor 插件
我正在寻找一种使用 Ingest Attachment Processor Plugin 的方法来自 Java 高级 REST 客户端。看来您需要执行两个步骤，即首先定义一个包含附件处理器的管道(例
Scylladb : Scylla write latency increasing over the time for continuous batch write ingestion
我有一个用例，我使用 gocql 驱动程序连续将数据批量摄取到 Scylla 中，在繁重的写入测试期间，我观察到 scyllas 写入响应延迟随着时间的推移而增加，有时它会导致 scylla 节点重新
amazon-web-services - AWS 时间流 : Unable to ingest records into AWS Timestream
众所周知，AWS Timestream 已于上周正式发布。从那时起，我一直在尝试对它进行试验并了解它如何建模和存储数据。我在将记录摄取到 Timestream 时遇到问题。我有一些日期为 202
Azure 流分析 : How to ingest image to Azure hub in real time from my client system?
我想将图像从我的系统持续发送到 Azure 云，并使用 Azure 流分析在云上处理图像。以下是我的要求: 将图像从客户端(我的桌面)持续发送到 Azure。在云端对收到的图像运行我的机器学习算法
elasticsearch - 插件[ingest-geoip]是为Elasticsearch版本6.2.4构建的，但是版本6.5.0正在运行
我正在使用ELK进行监视。几天前一切正常，突然停止工作。请帮我解决问题。错误日志: java.lang.IllegalArgumentException: Plugin [ingest-geoip
Azure 流分析 : identify right sizing and best practice to optimize ingestion to DataLaand processing
我有一个基于具有 17 个 TU 的标准层的 Azure 事件中心命名空间，它还可以自动膨胀最多 40 个 TU。它有 1 个事件中心实例和 12 个分区。此 EH 每秒接收 2400 条消息，即
python - 在 EchoPrint 上使用 ingest/fastingest 时出现 400 Bad Request 错误
我正在执行以下操作来启动 codegen/服务器以进行回显打印! cd echoprint-server/solr/solr java -Dsolr.solr.home=/home/path/to/e
azure - 使用 azure-kusto-ingest nodejs 客户端将数据提取到 azure data explorer 时出现 KustoAuthenticationError
当我运行此代码时，它显示 KustoAuthenticationError: 无法获取云集群信息 https://clusterName.kusto.windows.net appId 是应用程序(客

首页

博学

6Ren·AI

商城

google-bigquery - BigQuery : cost of querying tables partitioned by ingestion time vs date/timestamp partitioned