gpt4 book ai didi

sql - 如何加速小型 Vertica 数据库中缓慢的多连接查询(总行数约 120K,10 分钟)

转载 作者:行者123 更新时间:2023-12-02 04:11:29 26 4
gpt4 key购买 nike

我很想得到您的帮助,了解为什么这个大量连接的查询需要大约 10 分钟才能在一个由 7 个表(总共 < 120K 行)组成的小型数据库上运行,并且最好能获得您关于如何在我们的小型数据库上加快查询速度的建议。四个节点的集群。我已将支持信息放在这里:https://gist.github.com/anonymous/8862796 (表列表、按表列出的字段列表以及表大小),但以下是查询和 EXPLAIN VERBOSE 输出。我对此查询运行 ANALYZE_WORKLOAD() ,然后按照其建议在所有表上运行 ANALYZE_STATISTICS 。这导致没有任何改善。然后,我执行了运行数据库设计器的第二个建议,这导致性能更慢。我非常感谢您的帮助。

个人资料信息

感谢以下有关“个人资料”的提示。我运行它并将结果放在这里:https://gist.github.com/anonymous/8935190 。它有 8K 行长,所以也许我没有正确运行它(要点中的详细信息)。问题:我如何开始分析它?

查询背景故事

查询之所以困惑,主要是因为它是为我们的机器学习研究软件的每次运行动态生成的,该软件必须应用各种条件,以图形方式遍历所涉及的 E-R 表。在本例中,路径为 [rates, movie, rates, ml_user, rates, movie, rates]。查询是在程序探索解决方案空间的过程中逐步建立的,这就是为什么(目前)没有@wumpz 和@Bohemian 下面善意而正确地建议的优化,例如消除子选择。这意味着我在短期内有点坚持目前的形式:-/

 ------------------------------ 
QUERY PLAN DESCRIPTION:
------------------------------

Opt Vertica Options
--------------------
PLAN_OUTPUT_SUPER_VERBOSE


EXPLAIN VERBOSE
SELECT relVarTable0.id AS id, relVarTable1.val, relVarTable2.val
FROM (SELECT id FROM rates) relVarTable0
LEFT JOIN
(SELECT rates1.id AS id, AVG(rates4.rating) AS val
FROM rates rates1, movie movie1, rates rates2, ml_user ml_user1, rates rates3, movie movie2, rates rates4
WHERE movie1.id = rates1.movie_id AND movie1.id = rates2.movie_id AND ml_user1.id = rates2.ml_user_id AND ml_user1.id = rates3.ml_user_id AND movie2.id = rates3.movie_id AND movie2.id = rates4.movie_id AND movie1.id <> movie2.id AND rates1.id <> rates2.id AND rates2.id <> rates3.id AND rates3.id <> rates4.id AND rates4.rating IS NOT NULL
GROUP BY rates1.id) relVarTable1
ON relVarTable0.id = relVarTable1.id
LEFT JOIN
(SELECT rates1.id AS id, rates1.rating AS val
FROM rates rates1
WHERE rates1.rating IS NOT NULL ) relVarTable2
ON relVarTable0.id = relVarTable2.id;

Access Path:
Sort Key: (V(1,1))
LDISTRIB_UNSEGMENTED
+-JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 4489.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 5441368.000000 Memory(B): 1209184.000000 Netwrk(B): 1209184.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 40] (PATH ID: 1) Inner (RESEGMENT)
| Join Cond: (relVarTable0.id = relVarTable2.id)
| Execute on: All Nodes
| Sort Key: (V(1,1))
| LDISTRIB_UNSEGMENTED
| +-- Outer -> JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 4197.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 1369200.000000 Memory(B): 0.000000 Netwrk(B): 604600.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 2) Outer (RESEGMENT)
| | Join Cond: (relVarTable0.id = relVarTable1.id)
| | Execute on: All Nodes
| | Sort Key: (V(1,1))
| | LDISTRIB_UNSEGMENTED
| | +-- Outer -> SELECT [Cost: 20.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 3)
| | | Execute on: All Nodes
| | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | LDISTRIB_UNSEGMENTED
| | | +---> STORAGE ACCESS for rates [Cost: 20.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 4)
| | | | Column Cost Aspects: [ Disk(B): 196608.000000 CPU(B): 0.000000 Memory(B): 604600.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
| | | | Projection: movielens_test.rates_b0
| | | | Materialize: rates.id
| | | | Execute on: All Nodes
| | | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | | LDISTRIB_SEGMENTED
| | +-- Inner -> SELECT [Cost: 4067.000000, Rows: 10000.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 5)
| | | Execute on: All Nodes
| | | Sort Key: (rates.id)
| | | LDISTRIB_UNSEGMENTED
| | | +---> GROUPBY HASH (SORT OUTPUT) (GLOBAL RESEGMENT GROUPS) (LOCAL RESEGMENT GROUPS) [Cost: 4067.000000, Rows: 10000.000000 Disk(B): 0.000000 CPU(B): 6650600.000000 Memory(B): 640000.000000 Netwrk(B): 6890600.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 6)
| | | | Aggregates: sum_float(<SVAR>), count(<SVAR>)
| | | | Group By: rates1.id
| | | | Execute on: All Nodes
| | | | Sort Key: (rates.id)
| | | | LDISTRIB_SEGMENTED
| | | | +---> JOIN HASH [Cost: 2869.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 12091944.000000 Memory(B): 3022960.000000 Netwrk(B): 1813776.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 88] (PATH ID: 7) Inner (RESEGMENT)
| | | | | Join Cond: (movie2.id = rates4.movie_id)
| | | | | Join Filter: (rates3.id <> rates4.id)
| | | | | Execute on: All Nodes
| | | | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | | | LDISTRIB_UNSEGMENTED
| | | | | +-- Outer -> JOIN HASH [Cost: 2395.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 9110592.000000 Memory(B): 41592.000000 Netwrk(B): 4246064.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 64] (PATH ID: 8) Outer (RESEGMENT)(LOCAL ROUND ROBIN) Inner (RESEGMENT)
| | | | | | Join Cond: (movie2.id = rates3.movie_id)
| | | | | | Join Filter: (movie1.id <> movie2.id)
| | | | | | Execute on: All Nodes
| | | | | | Runtime Filter: (SIP1(HashJoin): movie2.id)
| | | | | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | | | | LDISTRIB_SEGMENTED
| | | | | | +-- Outer -> JOIN HASH [Cost: 1625.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 10278200.000000 Memory(B): 3023000.000000 Netwrk(B): 1813800.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 56] (PATH ID: 9) Inner (RESEGMENT)
| | | | | | | Join Cond: (ml_user1.id = rates3.ml_user_id)
| | | | | | | Join Filter: (rates2.id <> rates3.id)
| | | | | | | Execute on: All Nodes
| | | | | | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | | | | | LDISTRIB_UNSEGMENTED
| | | | | | | +-- Outer -> JOIN HASH [Cost: 1163.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 5582544.000000 Memory(B): 141144.000000 Netwrk(B): 2465448.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 40] (PATH ID: 10) Outer (RESEGMENT)(LOCAL ROUND ROBIN) Inner (RESEGMENT)
| | | | | | | | Join Cond: (ml_user1.id = rates2.ml_user_id)
| | | | | | | | Execute on: All Nodes
| | | | | | | | Runtime Filter: (SIP2(HashJoin): ml_user1.id)
| | | | | | | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | | | | | | LDISTRIB_SEGMENTED
| | | | | | | | +-- Outer -> JOIN HASH [Cost: 711.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 8464400.000000 Memory(B): 2418400.000000 Netwrk(B): 1813800.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 32] (PATH ID: 11) Outer (RESEGMENT)(LOCAL ROUND ROBIN)
| | | | | | | | | Join Cond: (movie1.id = rates2.movie_id)
| | | | | | | | | Join Filter: (rates1.id <> rates2.id)
| | | | | | | | | Execute on: All Nodes
| | | | | | | | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | | | | | | | LDISTRIB_SEGMENTED
| | | | | | | | | +-- Outer -> STORAGE ACCESS for rates2 [Cost: 59.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 12)
| | | | | | | | | | Column Cost Aspects: [ Disk(B): 589824.000000 CPU(B): 0.000000 Memory(B): 1813800.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
| | | | | | | | | | Projection: movielens_test.rates_b0
| | | | | | | | | | Materialize: rates2.id, rates2.ml_user_id, rates2.movie_id
| | | | | | | | | | Execute on: All Nodes
| | | | | | | | | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | | | | | | | | LDISTRIB_SEGMENTED
| | | | | | | | | +-- Inner -> JOIN HASH [Cost: 268.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 3064592.000000 Memory(B): 41592.000000 Netwrk(B): 1223064.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 13) Outer (RESEGMENT)(LOCAL ROUND ROBIN) Inner (RESEGMENT)
| | | | | | | | | | Join Cond: (movie1.id = rates1.movie_id)
| | | | | | | | | | Execute on: All Nodes
| | | | | | | | | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | | | | | | | | LDISTRIB_SEGMENTED
| | | | | | | | | | +-- Outer -> STORAGE ACCESS for rates1 [Cost: 39.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 14)
| | | | | | | | | | | Column Cost Aspects: [ Disk(B): 393216.000000 CPU(B): 0.000000 Memory(B): 1209200.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
| | | | | | | | | | | Projection: movielens_test.rates_b0
| | | | | | | | | | | Materialize: rates1.id, rates1.movie_id
| | | | | | | | | | | Execute on: All Nodes
| | | | | | | | | | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | | | | | | | | | LDISTRIB_SEGMENTED
| | | | | | | | | | +-- Inner -> STORAGE ACCESS for movie1 [Cost: 5.000000, Rows: 1733.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 15)
| | | | | | | | | | | Column Cost Aspects: [ Disk(B): 65536.000000 CPU(B): 0.000000 Memory(B): 13864.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
| | | | | | | | | | | Projection: movielens_test.movie_b0
| | | | | | | | | | | Materialize: movie1.id
| | | | | | | | | | | Execute on: All Nodes
| | | | | | | | | | | Sort Key: (movie.id, movie.title, movie.year, movie.imdb_id, movie.rotten_tomatoes_id, movie.rotten_tomatoes_critic_score, movie.rotten_tomatoes_audience_score, movie.budget, movie.gross, movie.mpaa_rating, movie.runtime, movie.action, movie.adventure, movie.animation, movie.childrens, movie.comedy, movie.crime, movie.documentary, movie.drama, movie.fantasy, movie.film_noir, movie.horror, movie.musical, movie.mystery, movie.romance, movie.sci_fi, movie.thriller, movie.war, movie.western, movie.is_usa, movie.num_actors, movie.num_ratings)
| | | | | | | | | | | LDISTRIB_SEGMENTED
| | | | | | | | +-- Inner -> STORAGE ACCESS for ml_user1 [Cost: 5.000000, Rows: 5881.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 16)
| | | | | | | | | Column Cost Aspects: [ Disk(B): 65536.000000 CPU(B): 0.000000 Memory(B): 47048.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
| | | | | | | | | Projection: movielens_test.ml_user_b0
| | | | | | | | | Materialize: ml_user1.id
| | | | | | | | | Execute on: All Nodes
| | | | | | | | | Sort Key: (ml_user.id, ml_user.gender, ml_user.age_range, ml_user.occupation, ml_user.zipcode, ml_user.num_ratings)
| | | | | | | | | LDISTRIB_SEGMENTED
| | | | | | | +-- Inner -> STORAGE ACCESS for rates3 [Cost: 59.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 17)
| | | | | | | | Column Cost Aspects: [ Disk(B): 589824.000000 CPU(B): 0.000000 Memory(B): 1813800.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
| | | | | | | | Projection: movielens_test.rates_b0
| | | | | | | | Materialize: rates3.id, rates3.ml_user_id, rates3.movie_id
| | | | | | | | Execute on: All Nodes
| | | | | | | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | | | | | | LDISTRIB_SEGMENTED
| | | | | | +-- Inner -> STORAGE ACCESS for movie2 [Cost: 5.000000, Rows: 1733.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 18)
| | | | | | | Column Cost Aspects: [ Disk(B): 65536.000000 CPU(B): 0.000000 Memory(B): 13864.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
| | | | | | | Projection: movielens_test.movie_b0
| | | | | | | Materialize: movie2.id
| | | | | | | Execute on: All Nodes
| | | | | | | Sort Key: (movie.id, movie.title, movie.year, movie.imdb_id, movie.rotten_tomatoes_id, movie.rotten_tomatoes_critic_score, movie.rotten_tomatoes_audience_score, movie.budget, movie.gross, movie.mpaa_rating, movie.runtime, movie.action, movie.adventure, movie.animation, movie.childrens, movie.comedy, movie.crime, movie.documentary, movie.drama, movie.fantasy, movie.film_noir, movie.horror, movie.musical, movie.mystery, movie.romance, movie.sci_fi, movie.thriller, movie.war, movie.western, movie.is_usa, movie.num_actors, movie.num_ratings)
| | | | | | | LDISTRIB_SEGMENTED
| | | | | +-- Inner -> STORAGE ACCESS for rates4 [Cost: 60.000000, Rows: 75574.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PUSHED GROUPING) Partial GroupBy: rates4.movie_id,rates4.id Partial Aggs: sum_float(<SVAR>),count(<SVAR>) (PATH ID: 19)
| | | | | | Column Cost Aspects: [ Disk(B): 589824.000000 CPU(B): 196608.000000 Memory(B): 1813784.000212 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
| | | | | | Projection: movielens_test.rates_b0
| | | | | | Materialize: rates4.rating, rates4.id, rates4.movie_id
| | | | | | Filter: (rates4.rating IS NOT NULL)/* sel=0.999974 ndv= 500 */
| | | | | | Execute on: All Nodes
| | | | | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | | | | LDISTRIB_SEGMENTED
| +-- Inner -> SELECT [Cost: 41.000000, Rows: 75574.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 20)
| | Execute on: All Nodes
| | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | LDISTRIB_UNSEGMENTED
| | +---> STORAGE ACCESS for rates1 [Cost: 41.000000, Rows: 75574.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 21)
| | | Column Cost Aspects: [ Disk(B): 393216.000000 CPU(B): 196608.000000 Memory(B): 1209184.000212 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
| | | Projection: movielens_test.rates_b0
| | | Materialize: rates1.rating, rates1.id
| | | Filter: (rates1.rating IS NOT NULL)/* sel=0.999974 ndv= 500 */
| | | Execute on: All Nodes
| | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | LDISTRIB_SEGMENTED


------------------------------

最佳答案

首先,我在您的解释计划中看到太多NO STATISTICS。这是一个坏主意,您应该修复它。

看到连接中表的顺序了吗?创建了哈希联接,并且您正在对最大的表进行完整的表扫描。通过执行散列连接(小表连接大表)而不是散列连接(大表连接小表)来修复此问题。

  1. 运行 DBD
  2. 运行分析
  3. 对其进行解释,并确保使用预测来回答您的问题查询
  4. 检查您的 movielens_test.rates 是否可以分区
    • 如果运行单个节点(MPP)将不会被使用,因为这是一个巨大的胜利
    • 针对您的查询运行个人资料并发布
    • 确保您在 DDL 上应用了正确的数据编码,如下所示以及您按列排序的顺序,以更好地解决您使用的问题预测(我认为这将由 DBD 完成 - 我总是检查)

最后一点,我总是这样做:

打开数据库日志并在运行查询时观察它。如果您的数据溢出到磁盘上,这可能是您的问题,因为您的排序数据大于分配的内存。

另一个选项是您在第一个子查询上创建预连接投影。但前提是您的数据不会遭受太多数据更改,因为预连接的投影在加载数据时非常糟糕。

关于sql - 如何加速小型 Vertica 数据库中缓慢的多连接查询(总行数约 120K,10 分钟),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21684140/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com