gpt4 book ai didi

python - 需要在 python 中逐行合并 2 个大的 csv 文件

转载 作者:太空宇宙 更新时间:2023-11-04 03:20:53 25 4
gpt4 key购买 nike

我正在为一个项目处理 2 个大型数据集文件。我管理逐行清理文件。但是,在尝试应用相同的逻辑基于公共(public)列合并 2 个文件时,它失败了。问题是第二个循环完全运行然后顶部循环运行(不知道为什么会这样)。我尝试使用 numpy

buys = np.genfromtxt('buys_dtsep.dat',delimiter=",",dtype='str')
clicks = np.genfromtxt('clicks_dtsep.dat',delimiter=",",dtype='str')
f = open('combined.dat', 'w')
for s in clicks:
for s2 in buys:
#process data

但是由于内存限制以及将数据加载到数组然后处理它所花费的时间,将具有 3300 万个条目的文件加载到数组中是不可行的。我试图逐行处理文件以避免内存不足。

buys = open('buys_dtsep.dat')
clicks = open('clicks_dtsep.dat')
f = open('combined.dat', 'w')

csv_buys = csv.reader(buys)
csv_clicks = csv.reader(clicks)



for s in csv_clicks:
print 'file 1 row x'#to check when it loops
for s2 in csv_buys:
print s2[0] #check looped data
#do merge op

打印的输出应该是

file 1 row 0
file 2 row 0
...
file 2 row x
file 1 row 1
and so on

我得到的输出是

file 2 row 0
file 2 row 1
...
file 2 row x
file 1 row 0
...
file 1 row z

如果上述循环问题可以解决,就可以逐行合并文件了。

更新:示例数据

购买文件样本

420374,2014-04-06,18:44:58.314,214537888,12462,1
420374,2014-04-06,18:44:58.325,214537850,10471,1
281626,2014-04-06,09:40:13.032,214535653,1883,1
420368,2014-04-04,06:13:28.848,214530572,6073,1
420368,2014-04-04,06:13:28.858,214835025,2617,1
140806,2014-04-07,09:22:28.132,214668193,523,1
140806,2014-04-07,09:22:28.176,214587399,1046,1

点击文件示例

420374,2014-04-06,18:44:58,214537888,0
420374,2014-04-06,18:41:50,214537888,0
420374,2014-04-06,18:42:33,214537850,0
420374,2014-04-06,18:42:38,214537850,0
420374,2014-04-06,18:43:02,214537888,0
420374,2014-04-06,18:43:10,214537888,0
420369,2014-04-07,19:39:43,214839373,0
420369,2014-04-07,19:39:56,214684513,0

最佳答案

以下方法有望有所帮助。它旨在更快并减少您的内存需求:

from heapq import merge
from itertools import groupby, ifilter

def get_click_entries(key):
with open('clicks.csv', 'rb') as f_clicks:
for entry in ifilter(lambda x: int(x[0]) == key, csv.reader(f_clicks)):
entry.insert(4, '') # add empty missing column
yield entry

# First create a set holding all column 0 click entries

with open('clicks.csv', 'rb') as f_clicks:
csv_clicks = csv.reader(f_clicks)
click_keys = {int(cols[0]) for cols in csv_clicks}

with open('buys.csv', 'rb') as f_buys, \
open('clicks.csv', 'rb') as f_clicks, \
open('merged.csv', 'wb') as f_merged:

csv_buys = csv.reader(f_buys)
csv_clicks = csv.reader(f_clicks)
csv_merged = csv.writer(f_merged)

for k, g in groupby(csv_buys, key=lambda x: int(x[0])):
if k in click_keys:
buys = sorted(g, key=lambda x: (x[1], x[2]))
clicks = sorted(get_click_entries(k), key=lambda x: (x[1], x[2]))
csv_merged.writerows(merge(buys, clicks)) # merge the two lists based on the timestamp
click_keys.remove(k)
csv_merged.writerows(g)

# Write any remaining click entries

for k in click_keys:
csv_merged.writerows(get_click_entries(k))

对于您的两个示例文件,这将产生以下输出:

140806,2014-04-07,09:22:28.132,214668193,523,1
140806,2014-04-07,09:22:28.176,214587399,1046,1
281626,2014-04-06,09:40:13.032,214535653,1883,1
420368,2014-04-04,06:13:28.848,214530572,6073,1
420368,2014-04-04,06:13:28.858,214835025,2617,1
420374,2014-04-06,18:41:50,214537888,,0
420374,2014-04-06,18:42:33,214537850,,0
420374,2014-04-06,18:42:38,214537850,,0
420374,2014-04-06,18:43:02,214537888,,0
420374,2014-04-06,18:43:10,214537888,,0
420374,2014-04-06,18:44:58,214537888,,0
420374,2014-04-06,18:44:58.314,214537888,12462,1
420374,2014-04-06,18:44:58.325,214537850,10471,1
420369,2014-04-07,19:39:43,214839373,,0
420369,2014-04-07,19:39:56,214684513,,0

它的工作原理是首先创建一组所有第 0 列条目,这意味着如果已知该条目不存在,您可以避免重新读取整个点击文件。然后,它尝试从 buys 中读取一组匹配的第 0 列条目,并从 clicks 中读取相应的第 0 列条目列表。然后根据时间戳对这些进行排序并按顺序合并在一起。然后从集合中删除该条目,这样它们就不会被重新读取。

关于python - 需要在 python 中逐行合并 2 个大的 csv 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34783461/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com