gpt4 book ai didi

mysql - 有没有办法优化这个mysql查询(更新,多重连接)?

转载 作者:行者123 更新时间:2023-11-29 10:52:02 26 4
gpt4 key购买 nike

我有一个查询正在截断的数据集上执行我想要的操作,但是当我在完整数据集(数百万行)上运行它时,它需要永远运行。

我有两个表 - microsat_table 和coverage_table。

微型卫星表:

+----+----------+-----------+---------+-------------------------------------------------+
| id | Seq_Name | SSR_Start | SSR_End | Sequence |
+----+----------+-----------+---------+-------------------------------------------------+
| 2 | chr2L | 11050 | 11067 | TTTAATTTAATTTAATTT |
| 3 | chr2L | 44173 | 44187 | TATGTATGTATGTAT |
| 5 | chr2L | 54431 | 54477 | ATAATAATATAATATAATATAATATAATATATAATAATATAATAATA |
| 6 | chr2L | 57571 | 57594 | ATATATATATATATATATATATAT |
| 7 | chr2L | 72439 | 72453 | CATACATACATACAT |
| 8 | chr2L | 74028 | 74042 | ATACATACATACATA |
| 9 | chr2L | 85573 | 85587 | ATTTTATTTTATTTT |
| 10 | chr2L | 92429 | 92443 | ACATACATACATACA |
| 11 | chr2L | 138132 | 138166 | TATATAGATATATAAATATATATATATATATATAT |
| 13 | chr2L | 162245 | 162259 | ATACATACATACATA |
+----+----------+-----------+---------+-------------------------------------------------+

覆盖率表:

| Seq_Name | Start | Stop  | Coverage |
+----------+-------+-------+----------+
| chr2L | 5716 | 5771 | 1 |
| chr2L | 8730 | 8824 | 1 |
| chr2L | 9894 | 9948 | 1 |
| chr2L | 19391 | 19491 | 1 |
| chr2L | 19575 | 19675 | 1 |
| chr2L | 19773 | 19776 | 1 |
| chr2L | 19776 | 19872 | 2 |
| chr2L | 21920 | 21959 | 1 |
| chr2L | 21959 | 22020 | 2 |
| chr2L | 22020 | 22059 | 1 |
+----------+-------+-------+----------+

我想向 microsat_table 添加一列,用于计算覆盖表中的 Start 和 Stop 值落在 microsat_table 中的 SSR_Start 和 SSR_End 值范围内的所有行的平均覆盖范围(来自coverage_table)。

结果示例:

+-----+----------+-----------+---------+--------------------------------+---------+
| id | Seq_Name | SSR_Start | SSR_End | Sequence | avg |
+-----+----------+-----------+---------+--------------------------------+---------+
| 53 | chr2L | 402489 | 402503 | AAAACAAAACAAAAC | 3.0000 |
| 64 | chr2L | 447214 | 447233 | CAGCAGCAGCAGCAGCAGCA | 8.0000 |
| 66 | chr2L | 457839 | 457868 | CAGCAGCAGCAACAGCAGCAGCAGGCAGCA | 2.0000 |
| 105 | chr2L | 579589 | 579603 | TCGAATCGAATCGAA | 11.0000 |
| 123 | chr2L | 628484 | 628501 | TAATGTTAATGTTAATGT | 6.0000 |
+-----+----------+-----------+---------+--------------------------------+---------+

我的查询是:

UPDATE microsat_table
JOIN
(SELECT m.id, SUM(p.Coverage)/count(p.Start)
AS avg FROM microsat_table m
LEFT OUTER JOIN coverage_table p
ON m.Seq_Name LIKE p.Seq_Name
WHERE m.Seq_Name LIKE p.Seq_Name GROUP BY m.id) AS qt
ON microsat_table.id = qt.id
SET microsat_table.avg = qt.avg;

解释截断表的结果:

+----+-------------+----------------------+------------+-------+---------------------------------------------------+-------------+---------+--------------------------------+--------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------------------+------------+-------+---------------------------------------------------+-------------+---------+--------------------------------+--------+----------+----------------------------------------------------+
| 1 | UPDATE | microsat_table_short | NULL | ALL | PRIMARY | NULL | NULL | NULL | 40356 | 100.00 | NULL |
| 1 | PRIMARY | <derived2> | NULL | ref | <auto_key0> | <auto_key0> | 4 | testdb.microsat_table_short.id | 1236 | 100.00 | NULL |
| 2 | DERIVED | m | NULL | index | PRIMARY,Sequence,Seq_Name,Motif,SSR_Start,SSR_End | Seq_Name | 53 | NULL | 40356 | 100.00 | Using index; Using temporary; Using filesort |
| 2 | DERIVED | p | NULL | ALL | NULL | NULL | NULL | NULL | 100163 | 1.23 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+----------------------+------------+-------+---------------------------------------------------+-------------+---------+--------------------------------+--------+----------+----------------------------------------------------+

我添加了索引(包括尝试 HASH 和 BTREE 索引),这大大加快了速度,但我已经让它在更大的数据集上运行了 1.5 天,但它仍然没有完成。

有人对如何让它运行得更快有什么建议吗?

谢谢!!

最佳答案

您的代码中有一些相对较小的缺陷。然而,最大的问题是,虽然您说您想要计算“覆盖表中的 Start 和 Stop 值落在 microsat_table 中的 SSR_Start 和 SSR_End 值范围内的所有行的平均覆盖范围(来自coverage_table)”,但您并没有这样做实际上似乎限制了查询这样做。相反,您只在 Seq_Name 上编码了匹配项.

下面的代码尝试修复这个问题(我使用了 >=<= 这可能不是您所需要的)以及其他更多的小问题:

UPDATE microsat_table
JOIN
(
SELECT
m.id,
AVG(p.Coverage) AS avg -- MySQL has it's own average function
FROM
microsat_table m
INNER JOIN coverage_table p ON -- Change to INNER JOIN, your old WHERE clause had this effect anyway
m.Seq_Name = p.Seq_Name -- Use '=' not 'Like' when looking for an exact match
WHERE
p.Start >= m.SSR_Start -- This WHERE clause is the most important change
AND p.End <= m.SSR_End -- You omitted it in your version
GROUP BY
m.id) AS qt
ON microsat_table.id = qt.id
SET microsat_table.avg = qt.avg;

关于mysql - 有没有办法优化这个mysql查询(更新,多重连接)?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43555262/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com