gpt4 book ai didi

sql - 在 WHERE 语句中使用 OR 进行慢速 JOIN 查询

转载 作者:太空狗 更新时间:2023-10-30 01:47:12 29 4
gpt4 key购买 nike

这是我的问题的一个简单示例:

CREATE TABLE test1 (id SERIAL, key TEXT UNIQUE, value TEXT);
CREATE TABLE test2 (id SERIAL, key TEXT UNIQUE, value TEXT);

INSERT INTO test1 (key, value)
SELECT i::TEXT, 'ABC' || i::TEXT
FROM generate_series(0, 1000000) AS i;

INSERT INTO test2 (key, value)
SELECT i::TEXT, 'ABC' || (i+1000)::TEXT
FROM generate_series(0, 600000) AS i;

INSERT INTO test2 (key, value)
SELECT i::TEXT, 'ABC' || (i+1000)::TEXT
FROM generate_series(1000000, 1200000) AS i;

CREATE INDEX test1_key ON test1 (key);
CREATE INDEX test1_value ON test1 (value);
CREATE INDEX test2_key ON test2 (key);
CREATE INDEX test2_value ON test2 (value);

VACUUM FULL VERBOSE ANALYZE test1;
VACUUM FULL VERBOSE ANALYZE test2;

这是我当前正在使用的查询,但需要超过 6 秒。

EXPLAIN ANALYZE 
SELECT test1.key AS key1, test1.value AS value1,
test2.key AS key2, test2.value AS value2
FROM test1
LEFT OUTER JOIN test2 ON (test1.key = test2.key)
WHERE test1.value = 'ABC1234' OR test2.value = 'ABC1234';

key1 | value1 | key2 | value2
------+---------+------+---------
234 | ABC234 | 234 | ABC1234
1234 | ABC1234 | 1234 | ABC2234
(2 rows)

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=27344.05..79728.10 rows=2 width=32) (actual time=5428.635..6097.098 rows=2 loops=1)
Hash Cond: (test1.key = test2.key)
Filter: ((test1.value = 'ABC1234'::text) OR (test2.value = 'ABC1234'::text))
-> Seq Scan on test1 (cost=0.00..16321.01 rows=1000001 width=15) (actual time=0.009..1057.315 rows=1000001 loops=1)
-> Hash (cost=13047.02..13047.02 rows=800002 width=17) (actual time=2231.964..2231.964 rows=800002 loops=1)
Buckets: 65536 Batches: 2 Memory Usage: 14551kB
-> Seq Scan on test2 (cost=0.00..13047.02 rows=800002 width=17) (actual time=0.010..980.232 rows=800002 loops=1)
Total runtime: 6109.042 ms
(8 rows)

在这两个表中,只有极少数数据集符合要求,但似乎没有观察到这一事实。我可以改为使用这样的查询:

EXPLAIN ANALYZE 
SELECT coalesce(test1.key, test3.key1) AS key1, coalesce(test1.value, test3.value1) AS value1,
coalesce(test2.key, test3.key2) AS key2, coalesce(test2.value, test3.value2) AS value2
FROM (SELECT test1.key AS key1, test1.value AS value1,
test2.key AS key2, test2.value AS value2
FROM (SELECT key, value FROM test1 WHERE value = 'ABC1234') AS test1
FULL JOIN (SELECT key, value FROM test2 WHERE value = 'ABC1234') AS test2
ON (test1.key = test2.key)) AS test3
LEFT OUTER JOIN test1 ON (test1.key = test3.key2)
LEFT OUTER JOIN test2 ON (test2.key = test3.key1)
WHERE test1.key IS NOT NULL;

key1 | value1 | key2 | value2
------+---------+------+---------
1234 | ABC1234 | 1234 | ABC2234
234 | ABC234 | 234 | ABC1234
(2 rows)

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (cost=0.00..33.56 rows=1 width=64) (actual time=0.075..0.083 rows=1 loops=1)
-> Nested Loop (cost=0.00..25.19 rows=1 width=47) (actual time=0.066..0.072 rows=1 loops=1)
-> Nested Loop Left Join (cost=0.00..16.80 rows=1 width=32) (actual time=0.051..0.054 rows=1 loops=1)
-> Index Scan using test2_value_key on test2 (cost=0.00..8.41 rows=1 width=17) (actual time=0.026..0.027 rows=1 loops=1)
Index Cond: (value = 'ABC1234'::text)
-> Index Scan using test1_key on test1 (cost=0.00..8.38 rows=1 width=15) (actual time=0.020..0.020 rows=0 loops=1)
Index Cond: (public.test1.key = public.test2.key)
Filter: (public.test1.value = 'ABC1234'::text)
-> Index Scan using test1_key on test1 (cost=0.00..8.38 rows=1 width=15) (actual time=0.011..0.013 rows=1 loops=1)
Index Cond: ((public.test1.key IS NOT NULL) AND (public.test1.key = public.test2.key))
-> Index Scan using test2_key on test2 (cost=0.00..8.36 rows=1 width=17) (actual time=0.001..0.001 rows=0 loops=1)
Index Cond: (public.test2.key = public.test1.key)
Total runtime: 0.139 ms

下面的查询更简单,但仍然太慢:

EXPLAIN ANALYZE
SELECT test1.key AS key1, test1.value AS value1,
test2.key AS key2, test2.value AS value2
FROM test1
LEFT OUTER JOIN test2 ON (test1.key = test2.key)
WHERE test1.value = 'ABC1234'
OR EXISTS (SELECT 1 FROM test2 t WHERE t.key = test1.key AND t.value = 'ABC1234');

key1 | value1 | key2 | value2
------+---------+------+---------
1234 | ABC1234 | 1234 | ABC2234
234 | ABC234 | 234 | ABC1234
(2 rows)

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Merge Left Join (cost=0.00..8446826.32 rows=500001 width=32) (actual time=615.706..1651.370 rows=2 loops=1)
Merge Cond: (test1.key = test2.key)
-> Index Scan using test1_key on test1 (cost=0.00..8398983.25 rows=500001 width=15) (actual time=28.449..734.567 rows=2 loops=1)
Filter: ((value = 'ABC1234'::text) OR (alternatives: SubPlan 1 or hashed SubPlan 2))
SubPlan 1
-> Index Scan using test2_key on test2 t (cost=0.00..8.36 rows=1 width=0) (never executed)
Index Cond: (key = $0)
Filter: (value = 'ABC1234'::text)
SubPlan 2
-> Index Scan using test2_value on test2 t (cost=0.00..8.37 rows=1 width=7) (actual time=0.376..0.380 rows=1 loops=1)
Index Cond: (value = 'ABC1234'::text)
-> Index Scan using test2_key on test2 (cost=0.00..39593.05 rows=800002 width=17) (actual time=0.019..498.456 rows=348894 loops=1)
Total runtime: 1651.453 ms
(13 rows)


所以我的问题是:是否有一个简单的查询会导致类似于第二个查询的快速执行计划,或者可能是一个索引或某种类型的计划提示。

(我知道对于那个例子,只有一个表包含两个值是合理的。但实际上表更复杂,表方案不能那么容易改变。)


PostgreSQL Version: 9.0.3
shared_buffers = 64MB
effective_cache_size = 32MB
work_mem = 16MB
maintenance_work_mem = 32MB
temp_buffers = 8MB
wal_buffers= 1MB


编辑:根据 Kipotlov 的建议,这里是 UNION 版本。为什么普通的OR查询不选择这么好的方案?

EXPLAIN ANALYZE
SELECT test1.key AS key1, test1.value AS value1,
test2.key AS key2, test2.value AS value2
FROM test1
LEFT OUTER JOIN test2 ON (test1.key = test2.key)
WHERE test1.value = 'ABC1234'
UNION
SELECT test1.key AS key1, test1.value AS value1,
test2.key AS key2, test2.value AS value2
FROM test1
LEFT OUTER JOIN test2 ON (test1.key = test2.key)
WHERE test2.value = 'ABC1234';

key1 | value1 | key2 | value2
------+---------+------+---------
1234 | ABC1234 | 1234 | ABC2234
234 | ABC234 | 234 | ABC1234
(2 rows)

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=33.64..33.66 rows=2 width=32) (actual time=0.114..0.119 rows=2 loops=1)
-> Sort (cost=33.64..33.64 rows=2 width=32) (actual time=0.111..0.113 rows=2 loops=1)
Sort Key: public.test1.key, public.test1.value, public.test2.key, public.test2.value
Sort Method: quicksort Memory: 17kB
-> Append (cost=0.00..33.63 rows=2 width=32) (actual time=0.046..0.097 rows=2 loops=1)
-> Nested Loop Left Join (cost=0.00..16.81 rows=1 width=32) (actual time=0.044..0.050 rows=1 loops=1)
-> Index Scan using test1_value_key on test1 (cost=0.00..8.44 rows=1 width=15) (actual time=0.023..0.024 rows=1 loops=1)
Index Cond: (value = 'ABC1234'::text)
-> Index Scan using test2_key on test2 (cost=0.00..8.36 rows=1 width=17) (actual time=0.014..0.016 rows=1 loops=1)
Index Cond: (public.test1.key = public.test2.key)
-> Nested Loop (cost=0.00..16.80 rows=1 width=32) (actual time=0.036..0.041 rows=1 loops=1)
-> Index Scan using test2_value_key on test2 (cost=0.00..8.41 rows=1 width=17) (actual time=0.019..0.020 rows=1 loops=1)
Index Cond: (value = 'ABC1234'::text)
-> Index Scan using test1_key on test1 (cost=0.00..8.38 rows=1 width=15) (actual time=0.013..0.015 rows=1 loops=1)
Index Cond: (public.test1.key = public.test2.key)
Total runtime: 0.173 ms
(16 rows)

最佳答案

首先,感谢您提出非常详细的问题。很少有人会在提问之前对他们的问题进行如此详细的研究。

我一直在考虑这个问题,问题似乎是 PostgreSQL 想要连接所有行,因为来自 test1 的每个不匹配的行都可能在 test2 中匹配——反之亦然.

解决方案是强制计划器分两步执行查询。一种方法是您已经尝试过的大型 UNION 查询——强制它在单独的查询中考虑每个表达式。

另一种方法是强制规划器首先找到匹配的键,然后执行连接,这样就不会出现歧义:

EXPLAIN ANALYZE
SELECT test1.key AS key1, test1.value AS value1,
test2.key AS key2, test2.value AS value2
FROM (
SELECT key FROM test1 WHERE value='ABC1234'
UNION SELECT key FROM test2 WHERE value='ABC1234'
) AS matching_keys
INNER JOIN test1 USING (key)
LEFT OUTER JOIN test2 USING (key);

Nested Loop Left Join (cost=16.84..34.44 rows=2 width=32) (actual time=0.211..0.280 rows=2 loops=1)
-> Nested Loop (cost=16.84..33.65 rows=2 width=15) (actual time=0.175..0.212 rows=2 loops=1)
-> Unique (cost=16.84..16.85 rows=2 width=6) (actual time=0.132..0.136 rows=2 loops=1)
-> Sort (cost=16.84..16.85 rows=2 width=6) (actual time=0.131..0.132 rows=2 loops=1)
Sort Key: public.test1.key
Sort Method: quicksort Memory: 25kB
-> Append (cost=0.00..16.83 rows=2 width=6) (actual time=0.058..0.110 rows=2 loops=1)
-> Index Scan using test1_value on test1 (cost=0.00..8.42 rows=1 width=6) (actual time=0.056..0.058 rows=1 loops=1)
Index Cond: (value = 'ABC1234'::text)
-> Index Scan using test2_value on test2 (cost=0.00..8.39 rows=1 width=7) (actual time=0.046..0.047 rows=1 loops=1)
Index Cond: (value = 'ABC1234'::text)
-> Index Scan using test1_key on test1 (cost=0.00..8.38 rows=1 width=15) (actual time=0.032..0.033 rows=1 loops=2)
Index Cond: (key = public.test1.key)
-> Index Scan using test2_key on test2 (cost=0.00..0.38 rows=1 width=17) (actual time=0.028..0.029 rows=1 loops=2)
Index Cond: (public.test1.key = key)
Total runtime: 0.390 ms
(16 rows)

UNION 再次充当 OR 的角色。不幸的是,对于像 value>'ABC1234' 这样的查询,这种方法仍然表现不佳。您可以通过提高 work_mem 来稍微改进它。我在这里不知所措。


关于你的最后一个问题:

Why does the normal OR query not choose such a good plan?

因为 PostgreSQL 规划器目前缺乏将 OR 表达式拆分为单独的 UNION 查询的能力。有一些注意事项使它比看起来更难。

PostgreSQL 规划器已经非常精细,但到目前为止,利用手动重写 SQL 已经可以实现的优化还不是一个重要的优先事项。

关于sql - 在 WHERE 语句中使用 OR 进行慢速 JOIN 查询,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/5353363/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com