sql - 快速计算不同列值的方法(使用索引？)

转载作者：行者123 更新时间：2023-11-29 11:56:04

问题:查询时间过长

我有一个看起来像这样的新表，有 3e6 行:

CREATE TABLE everything_crowberry (
    id             SERIAL  PRIMARY KEY,
    group_id       INTEGER,
    group_type     group_type_name,
    epub_id        TEXT,
    reg_user_id    INTEGER,
    device_id      TEXT,
    campaign_id    INTEGER,
    category_name  TEXT,
    instance_name  TEXT,
    protobuf       TEXT,
    UNIQUE (group_id, group_type, reg_user_id, category_name, instance_name)
);

这通常对我的上下文有意义，而且大多数查询的速度都可以接受。

但不快的是这样的查询:

analytics_staging=> explain analyze select count(distinct group_id) from everything_crowberry;
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=392177.29..392177.30 rows=1 width=4) (actual time=8909.698..8909.699 rows=1 loops=1)
   ->  Seq Scan on everything_crowberry  (cost=0.00..384180.83 rows=3198583 width=4) (actual time=0.461..6347.272 rows=3198583 loops=1)
 Planning time: 0.063 ms
 Execution time: 8909.730 ms
(4 rows)

Time: 8910.110 ms
analytics_staging=> select count(distinct group_id) from everything_crowberry;
 count 
-------
   481

Time: 8736.364 ms

我确实在 group_id 上创建了一个索引，但是虽然该索引用于 WHERE 子句，但它并没有在上面使用。所以我得出结论，我误解了 postgres 如何使用索引。请注意(查询结果)有不到 500 个不同的 group_id。

CREATE INDEX everything_crowberry_group_id ON everything_crowberry(group_id);

我有什么误解或如何使这个特定查询更快的指示吗？

更新

为了帮助解决评论中提出的问题，我在此处添加了建议的更改。对于 future 的读者，我提供了详细信息以更好地理解这是如何调试的。

我注意到大部分时间都花在了初始聚合上。

序列扫描

关闭 seqscan 会使情况变得更糟:

analytics_staging=> set enable_seqscan = false;

analytics_staging=> explain analyze select count(distinct group_id) from everything_crowberry;
                                                                         QUERY PLAN                                                                          
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=444062.28..444062.29 rows=1 width=4) (actual time=38927.323..38927.323 rows=1 loops=1)
   ->  Bitmap Heap Scan on everything_crowberry  (cost=51884.99..436065.82 rows=3198583 width=4) (actual time=458.252..36167.789 rows=3198583 loops=1)
         Heap Blocks: exact=35734 lossy=316446
         ->  Bitmap Index Scan on everything_crowberry_group  (cost=0.00..51085.35 rows=3198583 width=0) (actual time=448.537..448.537 rows=3198583 loops=1)
 Planning time: 0.064 ms
 Execution time: 38927.971 ms

Time: 38930.328 ms

哪里可以使情况变得更糟

限制为一组非常小的组 ID 会使情况变得更糟，而我可能认为对一组较小的事物进行计数会更容易。

analytics_staging=> explain analyze select count(distinct group_id) from everything_crowberry WHERE group_id > 380;
                                                                         QUERY PLAN                                                                         
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=385954.43..385954.44 rows=1 width=4) (actual time=13438.422..13438.422 rows=1 loops=1)
   ->  Bitmap Heap Scan on everything_crowberry  (cost=18742.95..383451.68 rows=1001099 width=4) (actual time=132.571..12673.233 rows=986572 loops=1)
         Recheck Cond: (group_id > 380)
         Rows Removed by Index Recheck: 70816
         Heap Blocks: exact=49632 lossy=79167
         ->  Bitmap Index Scan on everything_crowberry_group  (cost=0.00..18492.67 rows=1001099 width=0) (actual time=120.816..120.816 rows=986572 loops=1)
               Index Cond: (group_id > 380)
 Planning time: 1.294 ms
 Execution time: 13439.017 ms
(9 rows)

Time: 13442.603 ms

解释(分析，缓冲)

analytics_staging=> explain(analyze, buffers) select count(distinct group_id) from everything_crowberry;
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=392177.29..392177.30 rows=1 width=4) (actual time=7329.775..7329.775 rows=1 loops=1)
   Buffers: shared hit=16283 read=335912, temp read=4693 written=4693
   ->  Seq Scan on everything_crowberry  (cost=0.00..384180.83 rows=3198583 width=4) (actual time=0.224..4615.015 rows=3198583 loops=1)
         Buffers: shared hit=16283 read=335912
 Planning time: 0.089 ms
 Execution time: 7329.818 ms

Time: 7331.084 ms

work_mem 太小(参见上面的 explain(analyze, buffers))

将它从默认的 4 MB 增加到 10 MB 会有所改善，从 7300 毫秒增加到 5500 毫秒左右。

更改 SQL 也有一点帮助。

analytics_staging=> EXPLAIN(analyze, buffers) SELECT group_id FROM everything_crowberry GROUP BY group_id;
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=392177.29..392181.56 rows=427 width=4) (actual time=4686.525..4686.612 rows=481 loops=1)
   Group Key: group_id
   Buffers: shared hit=96 read=352099
   ->  Seq Scan on everything_crowberry  (cost=0.00..384180.83 rows=3198583 width=4) (actual time=0.034..4017.122 rows=3198583 loops=1)
         Buffers: shared hit=96 read=352099
 Planning time: 0.094 ms
 Execution time: 4686.686 ms

Time: 4687.461 ms

analytics_staging=> EXPLAIN(analyze, buffers) SELECT distinct group_id FROM everything_crowberry;
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=392177.29..392181.56 rows=427 width=4) (actual time=5536.151..5536.262 rows=481 loops=1)
   Group Key: group_id
   Buffers: shared hit=128 read=352067
   ->  Seq Scan on everything_crowberry  (cost=0.00..384180.83 rows=3198583 width=4) (actual time=0.030..4946.024 rows=3198583 loops=1)
         Buffers: shared hit=128 read=352067
 Planning time: 0.074 ms
 Execution time: 5536.321 ms

Time: 5537.380 ms

analytics_staging=> SELECT count(*) FROM (SELECT 1 FROM everything_crowberry GROUP BY group_id) ec;
 count 
-------
   481

Time: 4927.671 ms

创建 View 是一个重大胜利，但可能会在其他地方产生性能问题。

analytics_production=> CREATE VIEW everything_crowberry_group_view AS select distinct group_id, group_type FROM everything_crowberry;
CREATE VIEW
analytics_production=> EXPLAIN(analyze, buffers) SELECT distinct group_id FROM everything_crowberry_group_view;
                                                                                                           QUERY PLAN                                                                                                            
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=0.56..357898.89 rows=200 width=4) (actual time=0.046..1976.882 rows=447 loops=1)
   Buffers: shared hit=667230 read=109291 dirtied=108 written=988
   ->  Subquery Scan on everything_crowberry_group_view  (cost=0.56..357897.19 rows=680 width=4) (actual time=0.046..1976.616 rows=475 loops=1)
         Buffers: shared hit=667230 read=109291 dirtied=108 written=988
         ->  Unique  (cost=0.56..357890.39 rows=680 width=8) (actual time=0.044..1976.378 rows=475 loops=1)
               Buffers: shared hit=667230 read=109291 dirtied=108 written=988
               ->  Index Only Scan using everything_crowberry_group_id_group_type_reg_user_id_catego_key on everything_crowberry  (cost=0.56..343330.63 rows=2911953 width=8) (actual time=0.043..1656.409 rows=2912005 loops=1)
                     Heap Fetches: 290488
                     Buffers: shared hit=667230 read=109291 dirtied=108 written=988
 Planning time: 1.842 ms
 Execution time: 1977.086 ms

最佳答案

对于 group_id 中相对几个不同的值 (每组多行) - 似乎是你的情况:

3e6 rows / under 500 distinct group_id's

要使其快速，您需要索引跳过扫描(也称为松散索引扫描)。这在 Postgres 12 之前没有实现。但是你可以通过递归查询来解决这个限制:

替换:

select count(distinct group_id) from everything_crowberry;

与:

WITH RECURSIVE cte AS (
   (SELECT group_id FROM everything_crowberry ORDER BY group_id LIMIT 1)
   UNION ALL
   SELECT (SELECT group_id FROM everything_crowberry
           WHERE  group_id > t.group_id ORDER BY group_id LIMIT 1)
   FROM   cte t
   WHERE  t.group_id IS NOT NULL
   )
SELECT count(group_id) FROM cte;

我使用 count(group_id)而不是稍快的 count(*)方便地消除 NULL最终递归的值 - 作为 count(<expression>)只计算非空值。

另外，group_id 是否无关紧要可以是NULL ，因为您的查询无论如何都不计算在内。

可以使用已有的索引:

CREATE INDEX everything_crowberry_group_id ON everything_crowberry(group_id);

相关:

Optimize GROUP BY query to retrieve latest row per user

对于 group_id 中相对许多不同的值 (每组几行) - 或者对于小表 - 普通 DISTINCT会更快。通常在子查询中完成时最快，而不是在 count() 中添加子句:

SELECT count(group_id) -- or just count(*) to include possible NULL value
FROM (SELECT DISTINCT group_id FROM everything_crowberry) sub;

关于sql - 快速计算不同列值的方法(使用索引？)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57558942/

文章推荐： java - 可以将 Oracle ADF 与 IntelliJ Idea 一起使用吗？

文章推荐： postgresql - POSTGRES 中的 numeric(9,0) 和 int 有什么区别？

javascript - 使用 WebScriptEndpoint 使用 javascript 使用 WCF 服务
我在网上搜索但没有找到任何合适的文章解释如何使用 javascript 使用 WCF 服务，尤其是 WebScriptEndpoint。任何人都可以对此给出任何指导吗？谢谢最佳答案这是一篇关于
c - 没有结果!!使用 fork() 使用 dup2 使用 2 个管道运行 execlp()
我正在编写一个将运行 Linux 命令的 C 程序，例如: cat/etc/passwd | grep 列表 |剪切-c 1-5 我没有任何结果 *这里 parent 等待第一个 child (chi
python - 处理文件上传，使用 Pillow 调整大小，使用 SQLAlchemy 存储，使用 Flask 提供文件
所以我正在尝试处理文件上传，然后将该文件作为二进制文件存储到数据库中。在我存储它之后，我尝试在给定的 URL 上提供文件。我似乎找不到适合这里的方法。我需要使用数据库，因为我使用 Google 应用引
excel - 使用 IF 使用 VBA 在单元格中添加公式的问题
我正在尝试制作一个宏，将下面的公式添加到单元格中，然后将其拖到整个列中并在 H 列中复制相同的公式我想在 F 和 H 列中输入公式的数据 Range("F1").formula = "=IF(ISE
使用 OperatorPrecedenceParser 使用 FParsec 解析函数应用程序？
问题类似于this one ，但我想使用 OperatorPrecedenceParser 解析带有函数应用程序的表达式在 FParsec . 这是我的 AST: type Expression =
sql - 使用 sequelize 使用 where 查询编码计数
我想通过使用 sequelize 和 node.js 将这个查询更改为代码取决于在哪里 select COUNT(gender) as genderCount from customers where
bash - 使用 “let”分配Bash失败，使用 “/”
我正在使用GNU bash，版本5.0.3(1)-发行版(x86_64-pc-linux-gnu)，我想知道为什么简单的赋值语句会出现语法错误: #/bin/bash var1=/tmp
javascript - 使用 JavaScript 使用 FOR OF 数组循环时出现错误？
这里，为什么我的代码在 IE 中不起作用。我的代码适用于所有浏览器。没有问题。但是当我在 IE 上运行我的项目时，它发现错误。而且我的 jquery 类和 insertadjacentHTMl 也不
javascript - 使用 javascript 使用 for 属性更改表单标签内容
我正在尝试更改标签的innerHTML。我无权访问该表单，因此无法编辑 HTML。标签具有的唯一标识符是“for”属性。这是输入和标签的结构:
javascript - 使用 jquery 使用 .on() 将事件附加到页面上的动态插入按钮
我有一个页面，我可以在其中返回用户帖子，可以使用一些 jquery 代码对这些帖子进行即时评论，在发布新评论后，我在帖子下插入新评论以及删除按钮。问题是 Delete 按钮在新插入的元素上不起作用，
使用 awk 使用 sha1sum 进行散列
我有一个大约有 20 列的“管道分隔”文件。我只想使用 sha1sum 散列第一列，它是一个数字，如帐号，并按原样返回其余列。使用 awk 或 sed 执行此操作的最佳方法是什么？ Accounti
mysql - 使用 insert into 使用 mysql
我需要将以下内容插入到我的表中...我的用户表有五列 id、用户名、密码、名称、条目。 (我还没有提交任何东西到条目中，我稍后会使用 php 来做)但由于某种原因我不断收到这个错误:#1054 - U
jquery - 将输入字段值修剪为仅字母数字字符/使用 .使用 jQuery
所以我试图有一个输入字段，我可以在其中输入任何字符，但然后将输入的值小写，删除任何非字母数字字符，留下“。”而不是空格。例如，如果我输入: 地球的 70% 是水，-!*#$^^ & 30% 土地输
javascript - 使用 .innerHTML 使用 DOM
我正在尝试做一些我认为非常简单的事情，但出于某种原因我没有得到想要的结果？我是 javascript 的新手，但对 java 有经验，所以我相信我没有使用某种正确的规则。这是一个获取输入值、检查选择
php - 使用 angularjs 使用 where 子句从数据库获取数据
我想使用 angularjs 从 mysql 数据库加载数据。这就是应用程序的工作原理；用户登录，他们的用户名存储在 cookie 中。该用户名显示在主页上我想获取这个值并通过 angularjs
ios - 使用 UITableViewCell 使用 AutoLayout
我正在使用 autoLayout，我想在 UITableViewCell 上放置一个 UIlabel，它应该始终位于单元格的右侧和右侧的中心。这就是我想要实现的目标所以在这里你可以看到我正在谈论的
mysql - 使用 ElasticSearch 使用 or 和运算符搜索多个字段
我需要与 MySql 等效的 elasticsearch 查询。我的 sql 查询: SELECT DISTINCT t.product_id AS id FROM tbl_sup_price t
ios - 使用 Swift 使用 JSON
我正在实现代码以使用 JSON。 func setup() { if let flickrURL = NSURL(string: "https://api.flickr.com/
javascript - 使用 JavaScript 使用 for 循环声明变量
我尝试使用for循环声明变量，然后测试cols和rols是否相同。如果是，它将运行递归函数。但是，我在 javascript 中执行 do 时遇到问题。有人可以帮忙吗？现在，在比较 col.1 和
jquery - 使用 :after 使用 jquery 更改样式
我举了一个我正在处理的问题的简短示例。 HTML代码: 1 2 3 CSS 代码: .BB a:hover{ color: #000; } .BB > li:after {

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城