gpt4 book ai didi

即使两个表都很大,Oracle 总是使用 HASH JOIN 吗?

转载 作者:行者123 更新时间:2023-12-04 02:48:45 25 4
gpt4 key购买 nike

我的理解是,只有当两个表中的一个小到足以作为哈希表放入内存时,HASH JOIN 才有意义。

但是当我向 oracle 查询时,两个表都有几亿行,oracle 仍然想出了一个散列连接解释计划。即使我用 OPT_ESTIMATE(rows = ....) 提示欺骗了它,它也总是决定使用 HASH JOIN 而不是合并排序连接。

所以我想知道在两个表都非常大的情况下如何进行 HASH JOIN?

谢谢

最佳答案

当一切都可以放入内存时,哈希连接显然效果最好。但这并不意味着当表无法放入内存时,它们仍然不是最好的连接方法。我认为唯一的其他现实连接方法是合并排序连接。
如果哈希表不能放入内存,那么为合并排序连接对表进行排序也不能放入内存。合并连接需要对两个表进行排序。根据我的经验,对于加入和分组,散列总是比排序快。
但也有一些异常(exception)。来自 Oracle® Database Performance Tuning Guide, The Query Optimizer :

Hash joins generally perform better than sort merge joins. However,sort merge joins can perform better than hash joins if both of thefollowing conditions exist:

  The row sources are sorted already.
A sort operation does not have to be done.


测试
与创建数亿行相比,强制 Oracle 仅使用非常少量的内存更容易。
该图表显示散列连接的性能优于合并连接,即使表太大而无法放入(人为限制的)内存中:
Hash vs Merge

备注
对于性能调整,使用字节数通常比使用行数更好。但是表格的“真实”大小很难衡量,这就是图表显示行的原因。大小大约从 0.375 MB 到 14 MB。要仔细检查这些查询是否真的写入磁盘,您可以使用/*+ gather_plan_statistics */运行它们,然后查询 v$sql_plan_statistics_all。
我只测试了哈希连接与合并排序连接。我没有完全测试嵌套循环,因为这种连接方法对于大量数据总是非常慢。作为完整性检查,我确实将它与上次数据大小进行了一次比较,并且在我杀死它之前至少花了几分钟。
我还测试了不同的 _area_sizes、有序和无序数据以及连接列的不同差异性(匹配越多 CPU 限制越多,匹配越少越 IO 限制),得到了相对相似的结果。
但是,当内存量小得离谱时,结果就不同了。只有 32K sort|hash_area_size,合并排序连接明显更快。但是,如果您的内存力如此之少,您可能需要担心更多重大问题。
还有许多其他变量需要考虑,例如并行性、硬件、布隆过滤器等。人们可能已经写过关于这个主题的书,我什至没有测试过其中的一小部分可能性。但希望这足以证实散列连接最适合大数据的普遍共识。

代码
以下是我使用的脚本:
--Drop objects if they already exist
drop table test_10k_rows purge;
drop table test1 purge;
drop table test2 purge;

--Create a small table to hold rows to be added.
--("connect by" would run out of memory later when _area_sizes are small.)
--VARIABLE: More or less distinct values can change results. Changing
--"level" to something like "mod(level,100)" will result in more joins, which
--seems to favor hash joins even more.
create table test_10k_rows(a number, b number, c number, d number, e number);
insert /*+ append */ into test_10k_rows
select level a, 12345 b, 12345 c, 12345 d, 12345 e
from dual connect by level <= 10000;
commit;

--Restrict memory size to simulate running out of memory.
alter session set workarea_size_policy=manual;

--1 MB for hashing and sorting
--VARIABLE: Changing this may change the results. Setting it very low,
--such as 32K, will make merge sort joins faster.
alter session set hash_area_size = 1048576;
alter session set sort_area_size = 1048576;

--Tables to be joined
create table test1(a number, b number, c number, d number, e number);
create table test2(a number, b number, c number, d number, e number);

--Type to hold results
create or replace type number_table is table of number;

set serveroutput on;

--
--Compare hash and merge joins for different data sizes.
--
declare
v_hash_seconds number_table := number_table();
v_average_hash_seconds number;
v_merge_seconds number_table := number_table();
v_average_merge_seconds number;

v_size_in_mb number;
v_rows number;
v_begin_time number;
v_throwaway number;

--Increase the size of the table this many times
c_number_of_steps number := 40;
--Join the tables this many times
c_number_of_tests number := 5;

begin
--Clear existing data
execute immediate 'truncate table test1';
execute immediate 'truncate table test2';

--Print headings. Use tabs for easy import into spreadsheet.
dbms_output.put_line('Rows'||chr(9)||'Size in MB'
||chr(9)||'Hash'||chr(9)||'Merge');

--Run the test for many different steps
for i in 1 .. c_number_of_steps loop
v_hash_seconds.delete;
v_merge_seconds.delete;
--Add about 0.375 MB of data (roughly - depends on lots of factors)
--The order by will store the data randomly.
insert /*+ append */ into test1
select * from test_10k_rows order by dbms_random.value;

insert /*+ append */ into test2
select * from test_10k_rows order by dbms_random.value;

commit;

--Get the new size
--(Sizes may not increment uniformly)
select bytes/1024/1024 into v_size_in_mb
from user_segments where segment_name = 'TEST1';

--Get the rows. (select from both tables so they are equally cached)
select count(*) into v_rows from test1;
select count(*) into v_rows from test2;

--Perform the joins several times
for i in 1 .. c_number_of_tests loop
--Hash join
v_begin_time := dbms_utility.get_time;
select /*+ use_hash(test1 test2) */ count(*) into v_throwaway
from test1 join test2 on test1.a = test2.a;
v_hash_seconds.extend;
v_hash_seconds(i) := (dbms_utility.get_time - v_begin_time) / 100;

--Merge join
v_begin_time := dbms_utility.get_time;
select /*+ use_merge(test1 test2) */ count(*) into v_throwaway
from test1 join test2 on test1.a = test2.a;
v_merge_seconds.extend;
v_merge_seconds(i) := (dbms_utility.get_time - v_begin_time) / 100;
end loop;

--Get average times. Throw out first and last result.
select ( sum(column_value) - max(column_value) - min(column_value) )
/ (count(*) - 2)
into v_average_hash_seconds
from table(v_hash_seconds);

select ( sum(column_value) - max(column_value) - min(column_value) )
/ (count(*) - 2)
into v_average_merge_seconds
from table(v_merge_seconds);

--Display size and times
dbms_output.put_line(v_rows||chr(9)||v_size_in_mb||chr(9)
||v_average_hash_seconds||chr(9)||v_average_merge_seconds);

end loop;
end;
/

关于即使两个表都很大,Oracle 总是使用 HASH JOIN 吗?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8188093/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com