gpt4 book ai didi

Python/Pandas 逐行写入文件::内存使用

转载 作者:太空宇宙 更新时间:2023-11-03 18:06:07 24 4
gpt4 key购买 nike

我有一个大数据帧加载到 Pandas 内存中(~9GB)。我正在尝试编写一个遵循给定格式(Vowpal Wabbit)的文本文件,但对内存使用情况和性能感到困惑。虽然文件很大(4800 万行),但初始加载到 Pandas 中的效果还不错。写出该文件至少需要 6 个多小时,这确实压垮了我的笔记本电脑,并耗尽了我几乎所有的 RAM (32GB)。天真地,我假设该操作一次只在一行上运行,因此 RAM 使用量会非常小。有没有更有效的方法来处理这些数据?

with open("C:\\Users\\Desktop\\DATA\\train_mobile2.vw", "wb") as outfile:
for index, row in train.iterrows():
if row['click'] ==0:
vwline=""
vwline+="-1 "
else:
vwline=""
vwline+="1 "
vwline+="|a C1_"+ str(row['C1']) +\
" |b banpos_"+ str(row['banner_pos']) +\
" |c siteid_"+ str(row['site_id']) +\
" sitedom_"+ str(row['site_domain']) +\
" sitecat_"+ str(row['site_category']) +\
" |d appid_"+ str(row['app_id']) +\
" app_domain_"+ str(row['app_domain']) +\
" app_cat_"+ str(row['app_category']) +\
" |e d_id_"+ str(row['device_id']) +\
" d_ip_"+ str(row['device_ip']) +\
" d_os_"+ str(row['device_os']) +\
" d_make_"+ str(row['device_make']) +\
" d_mod_"+ str(row['device_model']) +\
" d_type_"+ str(row['device_type']) +\
" d_conn_"+ str(row['device_conn_type']) +\
" d_geo_"+ str(row['device_geo_country']) +\
" |f num_a:"+ str(row['C17']) +\
" numb:"+ str(row['C18']) +\
" numc:"+ str(row['C19']) +\
" numd:"+ str(row['C20']) +\
" nume:"+ str(row['C22']) +\
" numf:"+ str(row['C24']) +\
" |g c21_"+ str(row['C21']) +\
" C23_"+ str(row['C23']) +\
" |h hh_"+ str(row['hh']) +\
" |i doe_"+ str(row['doe'])
outfile.write(vwline + "\n")

响应用户的建议,

我编写了以下代码,但在运行最后一行时收到错误,提示“+ 不支持操作数类型:'numpy.ndarray' 和 'str'”

lines_T = np.where(train['click'] == 0, "-1 ", "1 ") +\
"|a C1_" + train['C1'].astype('str') +\
" |b banpos_"+ train['banner_pos'].astype('str') +\
....

"|h hh_"+ train['hh'].astype('str')+\
" |i doe_"+ train['doe'].astype('str') #ERROR HERE

line_T.to_csv("C:\Users\Desktop\DATA\KAGGLE\mobile\train_mobile.vw",mode='a', header=False,index=False)

最佳答案

不确定内存使用情况,但这绝对应该更快:

lines = np.where(train['click'] == 0, "-1 ", "1 ") +
"|a C1_" + train['C1'].astype('str') +
" |b banpos_"+ train['banner_pos'].astype('str') +
...

然后保存这些行

lines.to_csv(outfile, index=False)

如果内存成为问题,您也可以批量执行(例如一次处理几百万条记录)

关于Python/Pandas 逐行写入文件::内存使用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26867266/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com