gpt4 book ai didi

python - 使用 Python 根据特定列拆分 csv 文件

转载 作者:太空狗 更新时间:2023-10-30 02:38:30 25 4
gpt4 key购买 nike

我是一名 Python 初学者,做过一些基本的脚本。我最近的挑战是获取一个非常大的 csv 文件 (10gb+),并根据每行中特定变量的值将其拆分为多个较小的文件。

例如,文件可能如下所示:

Category,Title,Sales
"Books","Harry Potter",1441556
"Books","Lord of the Rings",14251154
"Series", "Breaking Bad",6246234
"Books","The Alchemist",12562166
"Movie","Inception",1573437

我想将文件拆分成单独的文件:Books.csv, Series.csv, Movie.csv

现实中会有几百个类别,而且不会排序。在这种情况下,它们位于第一列,但将来可能不会。

我在网上找到了一些解决方案,但在 Python 中找不到。有一个非常简单的 AWK 命令可以在一行中执行此操作,但我无法在工作中访问 AWK。

我编写了以下有效的代码,但我认为它可能非常低效。有人可以建议如何加快速度吗?

import csv

#Creates empty set - this will be used to store the values that have already been used
filelist = set()

#Opens the large csv file in "read" mode
with open('//directory/largefile', 'r') as csvfile:

#Read the first row of the large file and store the whole row as a string (headerstring)
read_rows = csv.reader(csvfile)
headerrow = next(read_rows)
headerstring=','.join(headerrow)

for row in read_rows:

#Store the whole row as a string (rowstring)
rowstring=','.join(row)

#Defines filename as the first entry in the row - This could be made dynamic so that the user inputs a column name to use
filename = (row[0])

#This basically makes sure it is not looking at the header row.
if filename != "Category":

#If the filename is not in the filelist set, add it to the list and create new csv file with header row.
if filename not in filelist:
filelist.add(filename)
with open('//directory/subfiles/' +str(filename)+'.csv','a') as f:
f.write(headerstring)
f.write("\n")
f.write(rowstring)
f.write("\n")
f.close()
#If the filename is in the filelist set, append the current row to the existing csv file.
else:
with open('//directory/subfiles/' +str(filename)+'.csv','a') as f:
f.write(rowstring)
f.write("\n")
f.close()

谢谢!

最佳答案

一种内存效率高的方法和一种避免不断重新打开文件以附加到此处的方法(只要您不打算生成大量打开的文件句柄)是使用 dict 来将类别映射到文件对象。在该文件尚未打开的地方,然后创建它并写入标题,然后始终将所有行写入相应的文件,例如:

import csv

with open('somefile.csv') as fin:
csvin = csv.DictReader(fin)
# Category -> open file lookup
outputs = {}
for row in csvin:
cat = row['Category']
# Open a new file and write the header
if cat not in outputs:
fout = open('{}.csv'.format(cat), 'w')
dw = csv.DictWriter(fout, fieldnames=csvin.fieldnames)
dw.writeheader()
outputs[cat] = fout, dw
# Always write the row
outputs[cat][1].writerow(row)
# Close all the files
for fout, _ in outputs.values():
fout.close()

关于python - 使用 Python 根据特定列拆分 csv 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46847803/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com