- mongodb - 在 MongoDB mapreduce 中,如何展平值对象?
- javascript - 对象传播与 Object.assign
- html - 输入类型 ="submit"Vs 按钮标签它们可以互换吗?
- sql - 使用 MongoDB 而不是 MS SQL Server 的优缺点
我一直在努力加快我正在使用的查询大约一周,并在这里提出了几个问题(How can I speed up fetching the results after running an sqlite query?、Is it normal that sqlite.fetchall() is so slow?、How to use min() and max() in an efficient way?)。
借助那里给出的答案的非常有用的帮助,我设法将时间缩短到 sqlite 查询耗时 100.95
秒并 fetchall 耗时:1485.43
。这仍然不够,所以在尝试了一些不同的索引后,我设法将一个样本的查询时间缩短到 0.08
秒,并将 fetchall 时间缩短到 54.97
秒。所以我想我终于设法加快了速度。
然后查询运行下一个样本,耗时 0.58
秒,而 fetchall 耗时 3952.80
秒。对于第三个示例,查询花费了 1.01
秒并花费了 1970.67
秒来获取所有数据。
第一个样本获取了 12951 行,第二个样本获取了 24972 行,第三个样本获取了 6470 行。我很好奇为什么第一个样本获取行的速度要快得多,而它的获取量只有第二个示例的一半。
代码(spectrumFeature_inputValues
是 (1,)、(2,) 和 (3,),来自使用的 3 个样本。):
self.cursor.execute('begin')
self.cursor.execute("EXPLAIN QUERY PLAN "+
"SELECT precursor_id, feature_table_id "+
"FROM `MSMS_precursor` "+
"INNER JOIN `spectrum` ON spectrum.spectrum_id = MSMS_precursor.spectrum_spectrum_id "+
"INNER JOIN `feature` ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+
"WHERE spectrum.scan_start_time BETWEEN feature.rtMin AND feature.rtMax "+
"AND MSMS_precursor.ion_mz BETWEEN feature.mzMin AND feature.mzMax "+
"AND feature.msrun_msrun_id = ?", spectrumFeature_InputValues)
print 'EXPLAIN QUERY PLAN: '
print self.cursor.fetchall()
import time
time0 = time.time()
self.cursor.execute("SELECT precursor_id, feature_table_id "+
"FROM `MSMS_precursor` "+
"INNER JOIN `spectrum` ON spectrum.spectrum_id = MSMS_precursor.spectrum_spectrum_id "+
"INNER JOIN `feature` ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+
"WHERE spectrum.scan_start_time BETWEEN feature.rtMin AND feature.rtMax "+
"AND MSMS_precursor.ion_mz BETWEEN feature.mzMin AND feature.mzMax "+
"AND feature.msrun_msrun_id = ?", spectrumFeature_InputValues)
print 'query took:',time.time()-time0,'seconds'
time0 = time.time()
precursorFeatureIds = self.cursor.fetchall()
print 'it fetched:',len(precursorFeatureIds),'rows'
print 'fetchall took',time.time()-time0,'seconds'
time0 = time.time()
for precursorAndFeatureID in precursorFeatureIds:
feature_has_MSMS_precursor_inputValues = (precursorAndFeatureID[0], precursorAndFeatureID[1])
self.cursor.execute("INSERT INTO `feature_has_MSMS_precursor` VALUES(?,?)", feature_has_MSMS_precursor_inputValues)
print 'inserting took',time.time()-time0,'seconds'
self.connection.commit()
结果:
EXPLAIN QUERY PLAN:
[(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.0754859447479 seconds
it fetched: 12951 rows
fetchall took 54.2855291367 seconds
inserting took 0.602859973907 seconds
It took 54.9704811573 seconds
EXPLAIN QUERY PLAN:
[(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.579694032669 seconds
it fetched: 24972 rows
fetchall took 3950.08093309 seconds
inserting took 2.11575508118 seconds
It took 3952.80745602 seconds
EXPLAIN QUERY PLAN:
[(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 1.01185703278 seconds
it fetched: 6470 rows
fetchall took 1970.622962 seconds
inserting took 0.673867940903 seconds
It took 1972.31343699 seconds
SQLite 创建语句:
-- -----------------------------------------------------
-- Table `feature`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `feature` (
`feature_table_id` INT PRIMARY KEY NOT NULL ,
`feature_id` VARCHAR(40) NOT NULL ,
`intensity` DOUBLE NOT NULL ,
`overallquality` DOUBLE NOT NULL ,
`charge` INT NOT NULL ,
`content` VARCHAR(45) NOT NULL ,
`intensity_cutoff` DOUBLE NOT NULL,
`mzMin` DOUBLE NULL ,
`mzMax` DOUBLE NULL ,
`rtMin` DOUBLE NULL ,
`rtMax` DOUBLE NULL ,
`msrun_msrun_id` INT NOT NULL ,
CONSTRAINT `fk_feature_msrun1`
FOREIGN KEY (`msrun_msrun_id` )
REFERENCES `msrun` (`msrun_id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION);
CREATE INDEX `fk_mzMin_feature` ON `feature` (`mzMin` ASC);
CREATE INDEX `fk_mzMax_feature` ON `feature` (`mzMax` ASC);
CREATE INDEX `fk_rtMin_feature` ON `feature` (`rtMin` ASC);
CREATE INDEX `fk_rtMax_feature` ON `feature` (`rtMax` ASC);
DROP TABLE IF EXISTS `spectrum`;
-- -----------------------------------------------------
-- Table `spectrum`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `spectrum` (
`spectrum_id` INT PRIMARY KEY NOT NULL ,
`spectrum_index` INT NOT NULL ,
`ms_level` INT NOT NULL ,
`base_peak_mz` DOUBLE NOT NULL ,
`base_peak_intensity` DOUBLE NOT NULL ,
`total_ion_current` DOUBLE NOT NULL ,
`lowest_observes_mz` DOUBLE NOT NULL ,
`highest_observed_mz` DOUBLE NOT NULL ,
`scan_start_time` DOUBLE NOT NULL ,
`ion_injection_time` DOUBLE,
`binary_data_mz` BLOB NOT NULL,
`binary_data_rt` BLOB NOT NULL,
`msrun_msrun_id` INT NOT NULL ,
CONSTRAINT `fk_spectrum_msrun1`
FOREIGN KEY (`msrun_msrun_id` )
REFERENCES `msrun` (`msrun_id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION);
CREATE INDEX `fk_spectrum_spectrum_id_1` ON `spectrum` (`spectrum_id` ASC);
CREATE INDEX `fk_spectrum_scahn_start_time_1` ON `spectrum` (`scan_start_time` ASC);
DROP TABLE IF EXISTS `feature_has_MSMS_precursor`;
-- -----------------------------------------------------
-- Table `spectrum_has_feature`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `feature_has_MSMS_precursor` (
`MSMS_precursor_precursor_id` INT NOT NULL ,
`feature_feature_table_id` INT NOT NULL ,
CONSTRAINT `fk_spectrum_has_feature_spectrum1`
FOREIGN KEY (`MSMS_precursor_precursor_id` )
REFERENCES `MSMS_precursor` (`precursor_id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION,
CONSTRAINT `fk_spectrum_has_feature_feature1`
FOREIGN KEY (`feature_feature_table_id` )
REFERENCES `feature` (`feature_table_id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION);
CREATE INDEX `fk_feature_has_MSMS_precursor_feature1` ON `feature_has_MSMS_precursor` (`feature_feature_table_id` ASC);
CREATE INDEX `fk_feature_has_MSMS_precursor_precursor1` ON `feature_has_MSMS_precursor` (`MSMS_precursor_precursor_id` ASC);
如您所见,我已经在光谱和特征中从 mz
和 rt
值中创建了索引,因为我认为大部分时间都花在比较这些数字上.
那么为什么第一个样本比第二个和第三个样本快得多?查询时间与 fetchall 时间有何关系?最重要的是,有什么方法可以加快速度吗?
与同事交谈后,可能是因为将点与二维维度(rtMin、rtMax、mzMin、mzMax)进行比较需要 n^2 时间。这大致对应于第二个 fetchall 花费的时间超过 60^2 秒(大约是第一个 fetchall 花费的时间),并且它检索到的行数不到两倍。但这并不能回答我的任何问题。
我尝试按照评论中的建议使用 R*tree。我做了一个新表:
CREATE VIRTUAL TABLE convexhull_edges USING rtree(
feature_feature_table_id,
rtMin, rtMax,
mzMin, mzMax,
);
并将我的查询更改为:
self.cursor.execute("SELECT precursor_id, feature_table_id "+
"FROM `MSMS_precursor` "+
"INNER JOIN `spectrum` ON spectrum.spectrum_id = MSMS_precursor.spectrum_spectrum_id "+
"INNER JOIN `feature` ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+
"INNER JOIN `convexhull_edges` ON convexhull_edges.feature_feature_table_id = feature.feature_table_id "
"WHERE spectrum.scan_start_time BETWEEN convexhull_edges.rtMin AND convexhull_edges.rtMax "+
"AND MSMS_precursor.ion_mz BETWEEN convexhull_edges.mzMin AND convexhull_edges.mzMax "+
"AND feature.msrun_msrun_id = ?", spectrumFeature_InputValues)
这给出了以下结果:
EXPLAIN QUERY PLAN:
[(0, 0, 3, u'SCAN TABLE convexhull_edges VIRTUAL TABLE INDEX 2: (~0 rows)'), (0, 1, 2, u'SEARCH TABLE feature USING INDEX sqlite_autoindex_feature_1 (feature_table_id=?) (~1 rows)'), (0, 2, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 3, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.0572800636292 seconds
it fetched: 13140 rows
fetchall took 34.4445540905 seconds
EXPLAIN QUERY PLAN:
[(0, 0, 3, u'SCAN TABLE convexhull_edges VIRTUAL TABLE INDEX 2: (~0 rows)'), (0, 1, 2, u'SEARCH TABLE feature USING INDEX sqlite_autoindex_feature_1 (feature_table_id=?) (~1 rows)'), (0, 2, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 3, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.819370031357 seconds
it fetched: 25402 rows
fetchall took 3625.72873998 seconds
EXPLAIN QUERY PLAN:
[(0, 0, 3, u'SCAN TABLE convexhull_edges VIRTUAL TABLE INDEX 2: (~0 rows)'), (0, 1, 2, u'SEARCH TABLE feature USING INDEX sqlite_autoindex_feature_1 (feature_table_id=?) (~1 rows)'), (0, 2, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 3, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.878498077393 seconds
it fetched: 6761 rows
fetchall took 1419.34246588 seconds
inserting took 0.340960025787 seconds
It took 1420.56637716 seconds
所以比我以前的方式快一点,但仍然不够快。接下来我要试试web_bod的方案。
使用 web_bod 的解决方案,我得到了以下时间:
EXPLAIN QUERY PLAN:
[(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.0521960258484 seconds
it fetched: 13052 rows
fetchall took 90.5810132027 seconds
EXPLAIN QUERY PLAN:
[(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.278959989548 seconds
it fetched: 25195 rows
fetchall took 4310.6012361 seconds
遗憾的是,第三个由于重新启动而没有完成。所以这比我的第一个解决方案快一点,但比使用 R*Tree 慢
处理一个速度非常慢的不同查询,我看到它正在进入不间断的 sleep (参见 this question )。所以我在运行这个查询时检查了 top,它在 R 和 D 状态之间切换,将 CPU 使用率从 100% 降低到 50%。这可能是它在提供所有解决方案的情况下运行如此缓慢的原因。
我迁移到 MySQL,但得到了相同的结果。
最佳答案
执行时间与每个表中的行数成几何比例,而不是算术,例如
3 tables with 10 rows each => 1,000 comparision
3 tables with 10, 10 and 40 rows => 4,000 comparisons
3 tables with 20 rows each => 8,000 comparisons
您可能可以重构查询以避免某些连接/游标 - 您什么时候需要答案?
你可以这样做吗:
SELECT precursor_id, feature_table_id
FROM MSMS_precursor
INNER JOIN
(
SELECT mzMin, mzMax, rtMin, rtMax, spectrum_id, feature_table_id, msrun_msrun_id
FROM spectrum
INNER JOIN
(select feature_table_id, mzMin, mzMax, rtMin, rtMax, msrun_msrun_id
from feature
where feature.msrun_msrun_id = 'value'
) subquery
ON subquery.msrun_msrun_id = spectrum.msrun_msrun_id
WHERE
spectrum.scan_start_time BETWEEN subquery.rtMin AND subquery.rtMax
) subquery
ON subquery.spectrum_id = MSMS_precursor.spectrum_spectrum_id
WHERE
MSMS_precursor.ion_mz BETWEEN subquery.mzMin AND subquery.mzMax
使用子查询可以减少表之间的比较次数 - 您可以快速过滤掉不需要的特征,然后是不相关的光谱,然后再搜索合适的前体。
我不使用 SQLLite - 但原则仍应适用。
更新:修复了 SQL 中的错误
注意事项:
您不必担心 AND,您只会得到:
5 月 18 日更新:
这是索引!!!您在搜索字段上有索引,但在参与连接的字段上没有索引 - 外键索引确实提高了性能:
CREATE INDEX `fk_msrun_msrun_id_feature` ON `feature` (`msrun_msrun_id` ASC);
CREATE INDEX `fk_spectrum_spectrum_id_feature` ON `feature` (`msrun_msrun_id` ASC);
CREATE INDEX `fk_spectrum_spectrum_id_MSMS_precursor` ON `MSMS_precursor` (`spectrum_spectrum_id` ASC);
关于python - 为什么相同的 SQLite 查询在只获取两倍的结果时会慢 30 倍?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10531898/
我创建了一个分支来开发新功能。由于这个新功能完全是作为一个新项目开发的,唯一可能的冲突来源是解决方案文件。 随着功能的开发,主分支更新了几次。当我完成开发和测试时,我做了: git checkout
我是一名优秀的程序员,十分优秀!