gpt4 book ai didi

python - 如何在 Python 中将读取一个大的 csv 文件分成大小均匀的 block ?

转载 作者:IT老高 更新时间:2023-10-28 21:02:20 24 4
gpt4 key购买 nike

基本上我有下一个过程。

import csv
reader = csv.reader(open('huge_file.csv', 'rb'))

for line in reader:
process_line(line)

看到这个相关的question .我想每 100 行发送一次流程线,以实现批量分片。

实现相关答案的问题是csv对象不可订阅,不能使用len。

>>> import csv
>>> reader = csv.reader(open('dataimport/tests/financial_sample.csv', 'rb'))
>>> len(reader)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: object of type '_csv.reader' has no len()
>>> reader[10:]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable
>>> reader[10]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable

我该如何解决这个问题?

最佳答案

只需将您的 reader 包装到 list 中即可下标。显然这会破坏非常大的文件(请参阅下面的更新中的替代方案):

>>> reader = csv.reader(open('big.csv', 'rb'))
>>> lines = list(reader)
>>> print lines[:100]
...

延伸阅读:How do you split a list into evenly sized chunks in Python?


更新 1(列表版本):另一种可能的方法是处理每个卡盘,因为它在遍历行时到达:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

chunk, chunksize = [], 100

def process_chunk(chuck):
print len(chuck)
# do something useful ...

for i, line in enumerate(reader):
if (i % chunksize == 0 and i > 0):
process_chunk(chunk)
del chunk[:] # or: chunk = []
chunk.append(line)

# process the remainder
process_chunk(chunk)

Update 2(生成器版本):我没有对其进行基准测试,但也许你可以通过使用 block generator来提高性能:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

def gen_chunks(reader, chunksize=100):
"""
Chunk generator. Take a CSV `reader` and yield
`chunksize` sized slices.
"""
chunk = []
for i, line in enumerate(reader):
if (i % chunksize == 0 and i > 0):
yield chunk
del chunk[:] # or: chunk = []
chunk.append(line)
yield chunk

for chunk in gen_chunks(reader):
print chunk # process chunk

# test gen_chunk on some dummy sequence:
for chunk in gen_chunks(range(10), chunksize=3):
print chunk # process chunk

# => yields
# [0, 1, 2]
# [3, 4, 5]
# [6, 7, 8]
# [9]

有一个小问题,如 @totalhack points out :

Be aware that this yields the same object over and over with different contents. This works fine if you plan on doing everything you need to with the chunk between each iteration.

关于python - 如何在 Python 中将读取一个大的 csv 文件分成大小均匀的 block ?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4956984/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com