gpt4 book ai didi

python - 将大文件逐行读入 Python2.7 时的内存使用

转载 作者:太空宇宙 更新时间:2023-11-03 10:52:02 24 4
gpt4 key购买 nike

计算器,

我正在从事一个涉及一些大文件 (10-50Gb) 的基因组学项目,我想将这些文件读入 Python 2.7 进行处理。我不需要将整个文件读入内存,而是简单地逐行读取每个文件,做一个小任务,然后继续。

我发现了类似的 SO 问题并尝试实现一些解决方案:

Efficient reading of 800 GB XML file in Python 2.7

How to read large file, line by line in python

当我在 17Gb 文件上运行以下代码时:

脚本 1(itertools):

#!/usr/bin/env python2

import sys
import string
import os
import itertools

if __name__ == "__main__":

#Read in PosList
posList=[]
with open("BigFile") as f:
for line in iter(f):
posList.append(line.strip())
sys.stdout.write(str(sys.getsizeof(posList)))

脚本 2(文件输入):

#!/usr/bin/env python2

import sys
import string
import os
import fileinput

if __name__ == "__main__":

#Read in PosList
posList=[]
for line in fileinput.input(['BigFile']):
posList.append(line.strip())
sys.stdout.write(str(sys.getsizeof(posList)))

SCRIPT3(行):

#!/usr/bin/env python2

import sys
import string
import os

if __name__ == "__main__":

#Read in PosList
posList=[]
with open("BigFile") as f:
for line in f:
posList.append(line.strip())
sys.stdout.write(str(sys.getsizeof(posList)))

SCRIPT4(产量):

#!/usr/bin/env python2

import sys
import string
import os

def readInChunks(fileObj, chunkSize=30):
while True:
data = fileObj.read(chunkSize)
if not data:
break
yield data

if __name__ == "__main__":

#Read in PosList
posList=[]
f = open('BigFile')
for chunk in readInChunks(f):
posList.append(chunk.strip())
f.close()
sys.stdout.write(str(sys.getsizeof(posList)))

从 17Gb 文件来看,Python 中最终列表的大小约为 5Gb [来自 sys.getsizeof()],但根据“top”,每个脚本使用超过 43Gb 的内存。

我的问题是:为什么内存使用率比输入文件或最终列表高得多?如果最终列表只有 5Gb,并且正在逐行读取 17Gb 文件输入,为什么每个脚本使用的内存达到 ~43Gb?有没有更好的方法来读取大文件而不会发生内存泄漏(如果是这样的话)?

非常感谢。

编辑:

“/usr/bin/time -v python script3.py”的输出:

Command being timed: "python script3.py"
User time (seconds): 159.65
System time (seconds): 21.74
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:01.96
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 181246448
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 10182731
Voluntary context switches: 315
Involuntary context switches: 16722
Swaps: 0
File system inputs: 33831512
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

顶部输出:

15816   user    20  0   727m    609m    2032    R   76.8    0.5 0:02.31 python
15816 user 20 0 1541m 1.4g 2032 R 99.6 1.1 0:05.31 python
15816 user 20 0 2362m 2.2g 2032 R 99.6 1.7 0:08.31 python
15816 user 20 0 3194m 3.0g 2032 R 99.6 2.4 0:11.31 python
15816 user 20 0 4014m 3.8g 2032 R 99.6 3 0:14.31 python
15816 user 20 0 4795m 4.6g 2032 R 99.6 3.6 0:17.31 python
15816 user 20 0 5653m 5.3g 2032 R 99.6 4.2 0:20.31 python
15816 user 20 0 6457m 6.1g 2032 R 99.3 4.9 0:23.30 python
15816 user 20 0 7260m 6.9g 2032 R 99.6 5.5 0:26.30 python
15816 user 20 0 8085m 7.7g 2032 R 99.9 6.1 0:29.31 python
15816 user 20 0 8809m 8.5g 2032 R 99.6 6.7 0:32.31 python
15816 user 20 0 9645m 9.3g 2032 R 99.3 7.4 0:35.30 python
15816 user 20 0 10.3g 10g 2032 R 99.6 8 0:38.30 python
15816 user 20 0 11.1g 10g 2032 R 100 8.6 0:41.31 python
15816 user 20 0 11.8g 11g 2032 R 99.9 9.2 0:44.32 python
15816 user 20 0 12.7g 12g 2032 R 99.3 9.9 0:47.31 python
15816 user 20 0 13.4g 13g 2032 R 99.6 10.5 0:50.31 python
15816 user 20 0 14.3g 14g 2032 R 99.9 11.1 0:53.32 python
15816 user 20 0 15.0g 14g 2032 R 99.3 11.7 0:56.31 python
15816 user 20 0 15.9g 15g 2032 R 99.9 12.4 0:59.32 python
15816 user 20 0 16.6g 16g 2032 R 99.6 13 1:02.32 python
15816 user 20 0 17.3g 17g 2032 R 99.6 13.6 1:05.32 python
15816 user 20 0 18.2g 17g 2032 R 99.9 14.2 1:08.33 python
15816 user 20 0 18.9g 18g 2032 R 99.6 14.9 1:11.33 python
15816 user 20 0 19.9g 19g 2032 R 100 15.5 1:14.34 python
15816 user 20 0 20.6g 20g 2032 R 99.3 16.1 1:17.33 python
15816 user 20 0 21.3g 21g 2032 R 99.6 16.7 1:20.33 python
15816 user 20 0 22.3g 21g 2032 R 99.9 17.4 1:23.34 python
15816 user 20 0 23.0g 22g 2032 R 99.6 18 1:26.34 python
15816 user 20 0 23.7g 23g 2032 R 99.6 18.6 1:29.34 python
15816 user 20 0 24.4g 24g 2032 R 99.6 19.2 1:32.34 python
15816 user 20 0 25.4g 25g 2032 R 99.3 19.9 1:35.33 python
15816 user 20 0 26.1g 25g 2032 R 99.9 20.5 1:38.34 python
15816 user 20 0 26.8g 26g 2032 R 99.9 21.1 1:41.35 python
15816 user 20 0 27.4g 27g 2032 R 99.6 21.7 1:44.35 python
15816 user 20 0 28.5g 28g 2032 R 99.6 22.3 1:47.35 python
15816 user 20 0 29.2g 28g 2032 R 99.9 22.9 1:50.36 python
15816 user 20 0 29.9g 29g 2032 R 99.6 23.5 1:53.36 python
15816 user 20 0 30.5g 30g 2032 R 99.6 24.1 1:56.36 python
15816 user 20 0 31.6g 31g 2032 R 99.6 24.7 1:59.36 python
15816 user 20 0 32.3g 31g 2032 R 100 25.3 2:02.37 python
15816 user 20 0 33.0g 32g 2032 R 99.6 25.9 2:05.37 python
15816 user 20 0 33.7g 33g 2032 R 99.6 26.5 2:08.37 python
15816 user 20 0 34.3g 34g 2032 R 99.6 27.1 2:11.37 python
15816 user 20 0 35.5g 34g 2032 R 99.6 27.7 2:14.37 python
15816 user 20 0 36.2g 35g 2032 R 99.6 28.4 2:17.37 python
15816 user 20 0 36.9g 36g 2032 R 100 29 2:20.38 python
15816 user 20 0 37.5g 37g 2032 R 99.6 29.6 2:23.38 python
15816 user 20 0 38.2g 38g 2032 R 99.6 30.2 2:26.38 python
15816 user 20 0 38.9g 38g 2032 R 99.6 30.8 2:29.38 python
15816 user 20 0 40.1g 39g 2032 R 100 31.4 2:32.39 python
15816 user 20 0 40.8g 40g 2032 R 99.6 32 2:35.39 python
15816 user 20 0 41.5g 41g 2032 R 99.6 32.6 2:38.39 python
15816 user 20 0 42.2g 41g 2032 R 99.9 33.2 2:41.40 python
15816 user 20 0 42.8g 42g 2032 R 99.6 33.8 2:44.40 python
15816 user 20 0 43.4g 43g 2032 R 99.6 34.3 2:47.40 python
15816 user 20 0 43.4g 43g 2032 R 100 34.3 2:50.41 python
15816 user 20 0 38.6g 38g 2032 R 100 30.5 2:53.43 python
15816 user 20 0 24.9g 24g 2032 R 99.7 19.6 2:56.43 python
15816 user 20 0 12.0g 11g 2032 R 100 9.4 2:59.44 python

编辑 2:

为了进一步说明,这里是问题的扩展。我在这里所做的是读取 FASTA 文件中的位置列表(Contig1/1、Contig1/2 等)。通过以下方式将其转换为充满 N 的字典:

keys = posList
values = ['N'] * len(posList)
speciesDict = dict(zip(keys, values))

然后,我再次逐行读取多个物种的 pileup 文件(将存在相同的问题),并通过以下方式获取最终的碱基调用:

with open (path+'/'+os.path.basename(path)+'.pileups',"r") as filein:
for line in iter(filein):
splitline=line.split()
if len(splitline)>4:
node,pos,ref,num,bases,qual=line.split()
loc=node+'/'+pos
cleanBases=getCleanList(ref,bases)
finalBase=getFinalBase_Pruned(cleanBases,minread,thresh)
speciesDict[loc] = finalBase

因为特定物种的 pileup 文件的长度或顺序不同,我正在创建列表以创建“公共(public)花园”方式来存储各个物种数据。如果某个物种的给定站点没有可用数据,则会收到“N”调用。否则,一个碱基被分配给字典中的站点。

最终结果是每个物种的文件都是有序且完整的,我可以从中进行下游分析。

因为逐行读取占用了太多内存,读取两个大文件会使我的资源重载,即使最终数据结构比我预期所需的内存小得多(增长的大小列表 + 一次添加一行数据)。

最佳答案

sys.getsizeof(posList) 没有告诉您我认为您认为的是什么:它告诉您包含行的列表对象 的大小;这不包括行本身的大小。以下是将一个大约 3.5Gb 的文件读取到我系统上的列表中的一些输出:

In [2]: lines = []

In [3]: with open('bigfile') as inf:
...: for line in inf:
...: lines.append(line)
...:
In [4]: len(lines)
Out[4]: 68318734

In [5]: sys.getsizeof(lines)
Out[5]: 603811872

In [6]: sum(len(l) for l in lines)
Out[6]: 3473926127

In [7]: sum(sys.getsizeof(l) for l in lines)
Out[7]: 6001719285

那里有超过 60 亿字节;在顶部,我的解释器此时使用了大约 7.5Gb。

字符串有相当大的开销:每个 37 个字节,看起来像:

In [2]: sys.getsizeof('0'*10)
Out[2]: 47

In [3]: sys.getsizeof('0'*100)
Out[3]: 137

In [4]: sys.getsizeof('0'*1000)
Out[4]: 1037

所以如果你的行比较短,很大一部分内存使用会是开销。

关于python - 将大文件逐行读入 Python2.7 时的内存使用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48385340/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com