gpt4 book ai didi

mysql - 即使使用覆盖索引也能优化 COUNT(DISTINCT) 缓慢

转载 作者:可可西里 更新时间:2023-11-01 06:32:36 25 4
gpt4 key购买 nike

我们在MySql中有一张表,大约有3000万条记录,表结构如下

CREATE TABLE `campaign_logs` (
`domain` varchar(50) DEFAULT NULL,
`campaign_id` varchar(50) DEFAULT NULL,
`subscriber_id` varchar(50) DEFAULT NULL,
`message` varchar(21000) DEFAULT NULL,
`log_time` datetime DEFAULT NULL,
`log_type` varchar(50) DEFAULT NULL,
`level` varchar(50) DEFAULT NULL,
`campaign_name` varchar(500) DEFAULT NULL,
KEY `subscriber_id_index` (`subscriber_id`),
KEY `log_type_index` (`log_type`),
KEY `log_time_index` (`log_time`),
KEY `campid_domain_logtype_logtime_subid_index` (`campaign_id`,`domain`,`log_type`,`log_time`,`subscriber_id`),
KEY `domain_logtype_logtime_index` (`domain`,`log_type`,`log_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |

以下是我的查询

我正在执行 UNION ALL 而不是使用 IN 操作

SELECT log_type,
DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
count(DISTINCT subscriber_id) AS COUNT,
COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
AND campaign_id='123'
AND log_type = 'EMAIL_OPENED'
AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_date

UNION ALL

SELECT log_type,
DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
COUNT(DISTINCT subscriber_id) AS COUNT,
COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
AND campaign_id='123'
AND log_type = 'EMAIL_SENT'
AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_date

UNION ALL

SELECT log_type,
DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
COUNT(DISTINCT subscriber_id) AS COUNT,
COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
AND campaign_id='123'
AND log_type = 'EMAIL_CLICKED'
AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_date,

以下是我的解释语句

+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+------------------------------------------+
| 1 | PRIMARY | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 55074 | Using where; Using index; Using filesort |
| 2 | UNION | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 330578 | Using where; Using index; Using filesort |
| 3 | UNION | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 1589 | Using where; Using index; Using filesort |
| NULL | UNION RESULT | <union1,2,3> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+------------------------------------------+
  1. 我将 COUNT(subscriber_id) 更改为 COUNT(*),但没有发现性能提升。

2.我从查询中删除了 COUNT(DISTINCT subscriber_id) ,然后我得到了巨大的 性能提升,我在大约 1.5 秒内得到结果,以前是 需要 50 秒 - 1 分钟。但是我需要来自查询的 subscriber_id 的不同计数

以下是当我从查询中删除 COUNT(DISTINCT subscriber_id) 时的解释

+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-----------------------------------------------------------+
| 1 | PRIMARY | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 55074 | Using where; Using index; Using temporary; Using filesort |
| 2 | UNION | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 330578 | Using where; Using index; Using temporary; Using filesort |
| 3 | UNION | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 1589 | Using where; Using index; Using temporary; Using filesort |
| NULL | UNION RESULT | <union1,2,3> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-----------------------------------------------------------+
  1. 我通过删除 UNION ALL 分别运行了三个查询。一个查询用了 32 秒,其他每个查询用了 1.5 秒,但第一个查询处理大约 350K 条记录,而其他查询只处理 2k 行

我可以通过省略 COUNT(DISTINCT...) 来解决我的性能问题,但我需要这些值。有没有一种方法可以重构我的查询,或者添加索引或其他东西,以获取 COUNT(DISTINCT...) 值,而且速度更快?

更新以下信息是关于上表的数据分布

为了1 个域名1 个事件20 个日志类型1k-200k 订阅者

我正在运行的上述查询,该域拥有超过 180,000 个订阅者。

最佳答案

如果没有 count(distinct) 的查询速度更快,也许您可​​以进行嵌套聚合:

SELECT log_type, log_date,
count(*) AS COUNT, sum(cnt) AS total
FROM (SELECT log_type,
DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
subscriber_id, count(*) as cnt
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN = 'xxx' AND
campaign_id = '123' AND
log_type IN ('EMAIL_SENT', 'EMAIL_OPENED', 'EMAIL_CLICKED') AND
log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND
CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_type, log_date, subscriber_id
) l
GROUP BY logtype, log_date;

运气好的话,这将花费 2-3 秒而不是 50 秒。但是,您可能需要将其分解为子查询,以获得完整的性能。因此,如果这没有显着的性能提升,请将 in 改回 = 其中一种类型。如果可行,则可能需要 union all

编辑:

另一种尝试是使用变量来枚举group by之前的值:

SELECT log_type, log_date, count(*) as cnt,
SUM(rn = 1) as sub_cnt
FROM (SELECT log_type,
DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
subscriber_id,
(@rn := if(@clt = concat_ws(':', campaign_id, log_type, log_time), @rn + 1,
if(@clt := concat_ws(':', campaign_id, log_type, log_time), 1, 1)
)
) as rn
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index) CROSS JOIN
(SELECT @rn := 0)
WHERE DOMAIN = 'xxx' AND
campaign_id = '123' AND
log_type IN ('EMAIL_SENT', 'EMAIL_OPENED', 'EMAIL_CLICKED') AND
log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00', '+00:00', '+05:30') AND
CONVERT_TZ('2015-03-01 23:59:58', '+00:00', '+05:30')
ORDER BY log_type, log_date, subscriber_id
) t
GROUP BY log_type, log_date;

这仍然需要另一种数据,但它可能会有所帮助。

关于mysql - 即使使用覆盖索引也能优化 COUNT(DISTINCT) 缓慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29076679/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com