gpt4 book ai didi

python : how to speed up this file loading

转载 作者:行者123 更新时间:2023-11-28 23:00:18 27 4
gpt4 key购买 nike

我正在寻找一种加速文件加载的方法:

数据包含约100万行,制表符以“\t”(tabulation char)分隔,utf8编码,使用以下代码解析完整文件大约需要9秒。但是,我希望几乎是一秒钟!

def load(filename):
features = []
with codecs.open(filename, 'rb', 'utf-8') as f:
previous = ""
for n, s in enumerate(f):
splitted = tuple(s.rstrip().split("\t"))
if len(splitted) != 2:
sys.exit("wrong format!")
if previous >= splitted:
sys.exit("unordered feature")
previous = splitted
features.append(splitted)
return features

我想知道是否有任何二进制格式的数据可以加快速度?或者,如果我可以从某些 NumPy 或任何其他库中获益以提高加载速度。

也许你可以就另一个速度瓶颈给我建议?

编辑:所以我尝试了您的一些想法,谢谢!顺便说一句,我真的需要巨大列表中的元组(字符串,字符串)...这是结果,我获得了 50% 的时间 :) 现在我要处理 NumPy 二进制数据,正如我注意到的另一个巨大的文件加载速度真的非常快......

import codecs

def load0(filename):
with codecs.open(filename, 'rb', 'utf-8') as f:
return f.readlines()

def load1(filename):
with codecs.open(filename, 'rb', 'utf-8') as f:
return [tuple(x.rstrip().split("\t")) for x in f.readlines()]

def load3(filename):
features = []
with codecs.open(filename, 'rb', 'utf-8') as f:
for n, s in enumerate(f):
splitted = tuple(s.rstrip().split("\t"))
features.append(splitted)
return features

def load4(filename):
with codecs.open(filename, 'rb', 'utf-8') as f:
for s in f:
yield tuple(s.rstrip().split("\t"))

a = datetime.datetime.now()
r0 = load0(myfile)
b = datetime.datetime.now()
print "f.readlines(): %s" % (b-a)

a = datetime.datetime.now()
r1 = load1(myfile)
b = datetime.datetime.now()
print """[tuple(x.rstrip().split("\\t")) for x in f.readlines()]: %s""" % (b-a)

a = datetime.datetime.now()
r3 = load3(myfile)
b = datetime.datetime.now()
print """load3: %s""" % (b-a)
if r1 == r3: print "OK: speeded and similars!"

a = datetime.datetime.now()
r4 = [x for x in load4(myfile)]
b = datetime.datetime.now()
print """load4: %s""" % (b-a)
if r4 == r3: print "OK: speeded and similars!"

结果:

f.readlines(): 0:00:00.208000
[tuple(x.rstrip().split("\t")) for x in f.readlines()]: 0:00:02.310000
load3: 0:00:07.883000
OK: speeded and similars!
load4: 0:00:07.943000
OK: speeded and similars!

非常奇怪的是,我注意到我可以在连续两次运行中获得几乎两倍的时间(但不是每次):

>>> ================================ RESTART ================================
>>>
f.readlines(): 0:00:00.220000
[tuple(x.rstrip().split("\t")) for x in f.readlines()]: 0:00:02.479000
load3: 0:00:08.288000
OK: speeded and similars!
>>> ================================ RESTART ================================
>>>
f.readlines(): 0:00:00.279000
[tuple(x.rstrip().split("\t")) for x in f.readlines()]: 0:00:04.983000
load3: 0:00:10.404000
OK: speeded and similars!

最新编辑:好吧,我尝试修改以使用 numpy.load...这对我来说很奇怪...来自“普通”文件和我的1022860 个字符串和 10 KB。在执行了 numpy.save(numpy.array(load1(myfile))) 之后,我的内存达到了 895 MB!然后用 numpy.load() 重新加载它,我在连续运行时得到这种计时:

  >>> ================================ RESTART ================================
loading: 0:00:11.422000 done.
>>> ================================ RESTART ================================
loading: 0:00:00.759000 done.

可能是 numpy 做了一些内存操作以避免将来重新加载?

最佳答案

试试这个版本,既然你提到检查不重要,我就把它去掉了。

def load(filename):
with codecs.open(filename, 'rb', 'utf-8') as f:
for s in f:
yield tuple(s.rstrip().split("\t"))

results = [x for x in load('somebigfile.txt')]

关于 python : how to speed up this file loading,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12250781/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com