Python和MYSQL性能: Writing a massive number of SQL query results to file-6ren

Python和MYSQL性能: Writing a massive number of SQL query results to file

转载作者：行者123 更新时间：2023-11-29 12:47:31

25

4

我有一个文件，其中每一行都包含一个值字典，我抓取并使用它来使用每个键作为查询来查询 mysql 数据库。每个查询的结果都会放置在一个字典中，一旦生成了查询字典的所有值，该行就会被写出。

IN > foo bar someotherinfo {1: 'query_val', 2: 'query_val', 3: 'query_val'
OUT > foo bar someotherinfo 1_result 2_result 3_result

整个过程似乎有点慢，因为我每个文件执行大约 200,000 个 mysql 查询，每个样本大约有 10 个文件，总共大约 30 个样本，所以我希望加快整个过程。

我只是想知道 fileIO 是否会造成瓶颈。我不是在返回的每个结果字典后写入 line_info (foo,bar,somblah) ，而是在将这些结果批量写入文件之前将它们分块到内存中会更好吗？

或者这只是一个必须等待的情况......？

Example Input line and output line
INPUT
 XM_006557349.1  1       -       exon    XM_006557349.1_exon_2   10316   10534   {1: 10509:10534', 2: '10488:10508', 3: '10467:10487', 4: '10446:10466', 5: '10425:10445', 6: '10404:10424', 7: '10383:10403', 8: '10362:10382', 9: '10341:10361', 10: '10316:10340'}
OUTPUT
 XM_006557349.1  1       -       exon    XM_006557349.1_exon_2   10316   105340.7083  0.2945  0.2     0.2931  0.125   0.1154  0.2095  0.5833  0.0569  0.0508


CODE
def array_2_meth(sample,bin_type,type,cur_meth):                                        
        bins_in = open('bin_dicts/'+bin_type,'r')
        meth_out = open('meth_data/'+bin_type+'_'+sample+'_plus_'+type+'_meth.tsv','w')                                                                                           
        for line in bins_in.readlines():
                meth_dict = {}                                                                                                                                                                       
                # build array of data from each line
                array = line.strip('\n').split('\t')                                                                                                                             
                mrna_id = array[0]
                assembly = array[1] 
                strand = array[2]
                bin_dict = ast.literal_eval(array[7]) 
                for bin in bin_dict:
                        coords = bin_dict[bin].split(':')
                        start = int(coords[0]) -1
                        end = int(coords[1]) +1
                        cur_meth.execute('select sum(mc)/sum(h) from allc_'+str(sample)+'_'+str(assembly) + ' where strand = \'' +str(strand) +'\' and class = \''+str(type)+'\' and position between '+str(start)+' and ' +str(end) + ' and h >= 5')
                        for row in cur_meth.fetchall():
                                if str(row[0]) == 'None':
                                         meth_dict[bin] = 'no_cov'
                                else:
                                         meth_dict[bin] = float(row[0])
                 meth_out.write('\t'.join(array[:7]))
                 for k in sorted(meth_dict.keys()):
                        meth_out.write('\t'+str(meth_dict[k]))
                 meth_out.write('\n')
        meth_out.close()

不确定添加此代码是否会有很大帮助，但它应该显示我处理此问题的方式。您可以就我在方法中犯的错误或如何优化的提示提供任何建议将不胜感激!

谢谢^_^

最佳答案

我认为 fileIO 不应该花费太长时间，主要瓶颈可能是您正在进行的查询量。但从您提供的示例中，我看不到这些开始和结束位置的模式，因此我不知道如何减少您正在进行的查询量。

根据你的测试结果，我有一个可能令人惊奇或愚蠢的想法。(而且我不知道关于Python的shxt，所以忽略语法哈哈)

似乎每个查询都只会返回一个值？也许你可以尝试类似的事情

SQL = ''
for bin in bin_dict:
    coords = bin_dict[bin].split(':')
    start = int(coords[0]) -1
    end = int(coords[1]) +1
    SQL += 'select sum(mc)/sum(h) from allc_'+str(sample)+'_'+str(assembly) + ' where strand = \'' +str(strand) +'\' and    class = \''+str(type)+'\' and position between '+str(start)+' and ' +str(end) + ' and h >= 5'
    SQL += 'UNION ALL'
    //somehow remove the last UNION ALL at end of loop

cur_meth.execute(str(SQL))
for row in cur_meth.fetchall():
    //loop through the 10 row array and write to file

核心思想是使用 UNION ALL 将所有查询连接为 1，因此您只需要执行 1 个事务，而不是示例中显示的 10 个事务。您还可以将 10 个写入文件操作减少为 1 个。可能的缺点是 UNION ALL 可能会很慢，但据我所知，它不应该比 10 个单独的查询花费更多的处理时间正如您在我的示例中保留 SQL 格式一样。

第二个明显的方法是多线程。如果您没有使用机器的所有处理能力，您可能会尝试同时启动多个脚本/程序，因为您所做的只是查询数据并且不修改任何内容。这会导致单个脚本稍微慢一些，但总体上更快，因为它应该减少查询之间的等待时间。

关于Python和MYSQL性能: Writing a massive number of SQL query results to file，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25236614/

25

4

0

文章推荐： performance - 为什么我的 postgreSQL 索引没有被使用？

文章推荐： ios - OpenGL格式纹理

文章推荐： ios - 无法将用户从 iOS 客户端注册到 Ejabberd 服务器 - XMPP

sql-server - 为什么 UPDATE .WRITE() 仅在我使用 [column].WRITE() 时才有效，而不是 [table].[column].WRITE()？
我正在执行 UPDATE .WRITE() 语句，并发现它显然只有在您像这样定义它时才有效: string sql = "UPDATE [dbo].[Table] SET [Column].WRITE
c - 使用一系列 write() 而不是单个 write()
我在 Unix 系统上用 C 编程。我知道: write(fd,"ABCD",4); 比这样做更好: write(fd, "A", 1); write(fd, "B", 1); write(fd, "
GoLang, "hash.Write", "write()"函数是从哪里来的？
func hash(s string) uint32 { h := fnv.New32a() h.Write([]byte(s)) return h.Sum32() } 对于这
javascript - Response.write 与 Document.write
在经典的 asp 页面中，有人告诉我您可以使用 vbscript 或 jscript。而 jscript 就是 javascript。所以我不确定 Response.Write、Response.W
javascript - stdin.write 抛出错误 : write EPIPE
当 openssl 子进程尝试 write() 到本地目录时，我收到此错误。在调用 write() 之前连接已关闭。它没有与 ssl 连接，因为我什至无法从 nodejs 文档启动示例代码。我错过了
java - writing 和 writing with flush 有什么区别？
最近我在试验netty。我遇到了以下问题: ctx.channel().write(new TextWebSocketFrame("hello")) 没有在客户端返回 hello，但是 ctx.cha
python - write 和 tempfile.write 的区别
请解释以下内容: def feed(data): import os print "DATA LEN: %s" % len(data) f = open("copy", "w") f.
.net - debug.write 和 Trace.write 有什么区别？
有什么区别debug.write 和 Trace.write ?每个应该什么时候使用？最佳答案在典型的发布构建配置中，Debug class 被禁用并且什么都不做。 Trace但是，仍然可以在发行
c# - Stream.Write 多次或连接字符串和 Stream.Write 一次更好吗？
我只是想知道，就性能而言，哪个更好(我在 FileStream 中使用 StreamWriter): 多次调用 Stream.Write(): StreamWriter sw = new Stream
c# - HttpResponse.Write 与 StringWriter.Write 有何不同？
我发现自己写给 stringwriter，然后在函数末尾执行 resp.Write(sw.ToString())。这是不必要的吗？如果我多次使用 HttpResponse.Write，即使我的页面是
javascript - win.document.write ('content' );无法读取未定义的属性 'write'
我正在尝试通过 JavaScript 文件从 electron 打开一个新窗口，它可以工作，并打开了新窗口，但我无法将 HTML/文本写入新文件。我收到那个错误: Cannot read proper
c++ - Qt QIODevice::write/QTcpSocket::write 和写入的字节
我们对 QIODevice::write 的一般行为和具体的 QTcpSocket 实现感到非常困惑。有一个 similar question已经，但答案并不令人满意。主要的混淆源于分别提到的 byt
fortran - Fortran 中 write(*,*) 和 write(6,*) 的区别
我知道这听起来像是一个愚蠢的问题: write(*,*) 和 write(6,*) ?我在我研究所的 super 计算机上运行一个复杂的代码，它通过一个不同于 6 的单元号输出一个数据文件，显然编译的
rust - 对我的可写类型使用 fmt::Write 或 io::Write trait？
我有一个结构体，它可以通过一系列复杂的方法调用转换为文本，其中包含大量 write!调用。此文本可以写入文件或调试日志。我正在决定是否使用 fmt::Write 或 io::Write .我不能真正使
c - 当我尝试使用 write (man 2 write) 函数时，为什么这段代码会卡住？
已关闭。这个问题是 not reproducible or was caused by typos 。目前不接受答案。这个问题是由拼写错误或无法再重现的问题引起的。虽然类似的问题可能是 on-top
`read()` 后面可以直接跟 `write()` 吗？ `write()` 后面可以直接跟 `read()` 吗？
In the C standard library, an output can't be followed by an input and vice versa. 对于Linux API，可以在re
javascript - 如何 document.write 然后在延迟 document.write 之后再次
我希望能够为一件事做 document.write。然后延迟半秒，然后再记录。写一些。你知道这是否可能吗？而且，如果是这样，怎么办？到目前为止，我已经尝试过了，但没有奏效: document.writ
javascript - 为什么从 onclick 属性调用的函数 write() 解析为 document.write()？
为什么通过 onclick 属性调用的 write() 函数解析为 document.write() 并替换文档？有什么办法可以阻止这种情况发生吗？ Write Function Alternat
Python 3 : write method vs. os.write 返回的字节数
我想创建一个包含多个“页面”的文本文件，并将每个页面的字节偏移量记录在一个单独的文件中。为此，我将字符串打印到主输出文件并使用 bytes_written += file.write(str) 计算字
c# - 一次构建一个大字符串并将其传递给 response.write 或为每个片段调用 response.write 是否更有效
关闭。这个问题需要更多focused .它目前不接受答案。想改进这个问题吗？更新问题，使其只关注一个问题 editing this post . 关闭 8 年前。 Improve this qu

首页

博学

6Ren·AI

商城

Python和MYSQL性能: Writing a massive number of SQL query results to file