gpt4 book ai didi

database - 优化 Redshift 查询的大 IN 条件

转载 作者:太空狗 更新时间:2023-10-30 01:44:53 24 4
gpt4 key购买 nike

我有一个 ~2TB 的完全清理的 Redshift 表,带有 distkey phash(高基数,数亿个值)和复合排序键 (phash, last_seen)

当我进行如下查询时:

SELECT
DISTINCT ret_field
FROM
table
WHERE
phash IN (
'5c8615fa967576019f846b55f11b6e41',
'8719c8caa9740bec10f914fc2434ccfd',
'9b657c9f6bf7c5bbd04b5baf94e61dae'
)
AND
last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'

它返回得非常快。但是,当我将哈希数增加到 10 以上时,Redshift 将 IN 条件从一堆 OR 转换为数组,根据 http://docs.aws.amazon.com/redshift/latest/dg/r_in_condition.html#r_in_condition-optimization-for-large-in-lists

问题是,当我有几十个 phash 值时,“优化”查询的响应时间从不到一秒变为超过半小时。换句话说,它停止使用排序键并进行全表扫描。

知道如何防止这种行为并保留使用排序键来保持快速查询吗?

这是 EXPLAIN <10 哈希和 >10 哈希之间的区别:

小于 10(0.4 秒):

XN Unique  (cost=0.00..157253450.20 rows=43 width=27)
-> XN Seq Scan on table (cost=0.00..157253393.92 rows=22510 width=27)
Filter: ((((phash)::text = '394e9a527f93377912cbdcf6789787f1'::text) OR ((phash)::text = '4534f9f8f68cc937f66b50760790c795'::text) OR ((phash)::text = '5c8615fa967576019f846b55f11b6e61'::text) OR ((phash)::text = '5d5743a86b5ff3d60b133c6475e7dce0'::text) OR ((phash)::text = '8719c8caa9740bec10f914fc2434cced'::text) OR ((phash)::text = '9b657c9f6bf7c5bbd04b5baf94e61d9e'::text) OR ((phash)::text = 'd7337d324be519abf6dbfd3612aad0c0'::text) OR ((phash)::text = 'ea43b04ac2f84710dd1f775efcd5ab40'::text)) AND (last_seen >= '2015-10-01 00:00:00'::timestamp without time zone) AND (last_seen <= '2015-10-31 23:59:59'::timestamp without time zone))

超过 10 个(45-60 分钟):

XN Unique  (cost=0.00..181985241.25 rows=1717530 width=27)
-> XN Seq Scan on table (cost=0.00..179718164.48 rows=906830708 width=27)
Filter: ((last_seen >= '2015-10-01 00:00:00'::timestamp without time zone) AND (last_seen <= '2015-10-31 23:59:59'::timestamp without time zone) AND ((phash)::text = ANY ('{33b84c5775b6862df965a0e00478840e,394e9a527f93377912cbdcf6789787f1,3d27b96948b6905ffae503d48d75f3d1,4534f9f8f68cc937f66b50760790c795,5a63cd6686f7c7ed07a614e245da60c2,5c8615fa967576019f846b55f11b6e61,5d5743a86b5ff3d60b133c6475e7dce0,8719c8caa9740bec10f914fc2434cced,9b657c9f6bf7c5bbd04b5baf94e61d9e,d7337d324be519abf6dbfd3612aad0c0,dbf4c743832c72e9c8c3cc3b17bfae5f,ea43b04ac2f84710dd1f775efcd5ab40,fb4b83121cad6d23e6da6c7b14d2724c}'::text[])))

最佳答案

您可以尝试创建临时表/子查询:

SELECT DISTINCT t.ret_field
FROM table t
JOIN (
SELECT '5c8615fa967576019f846b55f11b6e41' AS phash
UNION ALL
SELECT '8719c8caa9740bec10f914fc2434ccfd' AS phash
UNION ALL
SELECT '9b657c9f6bf7c5bbd04b5baf94e61dae' AS phash
-- UNION ALL
) AS sub
ON t.phash = sub.phash
WHERE t.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59';

或者在 block 中进行搜索(如果查询优化器将其合并为一个,则使用辅助表来存储中间结果):

SELECT ret_field
FROM table
WHERE phash IN (
'5c8615fa967576019f846b55f11b6e41',
'8719c8caa9740bec10f914fc2434ccfd',
'9b657c9f6bf7c5bbd04b5baf94e61dae')
AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
SELECT ret_field
FROM table
WHERE phash IN ( ) -- more hashes)
AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
-- ...

如果查询优化器将其合并为一个,您可以尝试使用临时表来获取中间结果

编辑:

SELECT DISTINCT t.ret_field
FROM table t
JOIN (SELECT ... AS phash
FROM ...
) AS sub
ON t.phash = sub.phash
WHERE t.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59';

关于database - 优化 Redshift 查询的大 IN 条件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33764635/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com