gpt4 book ai didi

python - 将 +10GB .csv 文件分割成相等的部分,无需读入内存

转载 作者:太空宇宙 更新时间:2023-11-03 21:39:00 24 4
gpt4 key购买 nike

我有 3 个超过 10GB 的文件,我需要将其拆分为 6 个较小的文件。我通常会使用 R 之类的东西来加载文件并将其分区为较小的 block ,但文件的大小阻止它们被读入 R - 即使有 20GB 的 RAM。

我不知道下一步如何继续,非常感谢任何提示。

最佳答案

在Python中,使用生成器/迭代器,您不应该将所有数据加载到内存中。

逐行阅读即可。

CSV 库为您提供了读取器和写入器类,可以完成这项工作。

要分割文件,您可以编写如下内容:

import csv

# your input file (10GB)
in_csvfile = open('source.csv', "r")

# reader, that would read file for you line-by-line
reader = csv.DictReader(in_csvfile)

# number of current line read
num = 0

# number of output file
output_file_num = 1

# your output file
out_csvfile = open('out_{}.csv'.format(output_file_num), "w")

# writer should be constructed in a read loop,
# because we need csv headers to be already available
# to construct writer object
writer = None

for row in reader:
num += 1

# Here you have your data line in a row variable

# If writer doesn't exists, create one
if writer is None:
writer = csv.DictWriter(
out_csvfile,
fieldnames=row.keys(),
delimiter=",", quotechar='"', escapechar='"',
lineterminator='\n', quoting=csv.QUOTE_NONNUMERIC
)

# Write a row into a writer (out_csvfile, remember?)
writer.writerow(row)

# If we got a 10000 rows read, save current out file
# and create a new one
if num > 10000:
output_file_num += 1
out_csvfile.close()
writer = None

# create new file
out_csvfile = open('out_{}.csv'.format(output_file_num), "w")

# reset counter
num = 0

# Closing the files
in_csvfile.close()
out_csvfile.close()

我没有测试过它,这是我凭空写出来的,所以,错误可能存在:)

关于python - 将 +10GB .csv 文件分割成相等的部分,无需读入内存,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53028454/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com