gpt4 book ai didi

python - 从无法放入内存的巨 Pandas 数据框中删除索引

转载 作者:太空宇宙 更新时间:2023-11-04 02:30:52 24 4
gpt4 key购买 nike

我有一个包含 5000 万条记录的文件,我有一个需要从文件中删除的索引列表。如果我想使用 pandas dataframe 来读取文件——我可能会遇到内存问题(如果我的内存有限)。假设我这样做:

df = pd.read_csv('input_file')
df = df.drop(df.index[example_ix_list])
df.to_csv('input_file', index=False)

我可能会遇到内存问题:

  File "/home/ec2-user/CloudMatcher/cloudmatcher/core/execution/user_interaction.py", line 768, in process
new_unlabel_df = unlabel_df.drop(unlabel_df.index[list_ix])
File "/home/ec2-user/anaconda2/envs/cloudmatch/lib/python2.7/site-packages/pandas/core/generic.py", line 2162, in drop
dropped = self.reindex(**{axis_name: new_axis})
File "/home/ec2-user/anaconda2/envs/cloudmatch/lib/python2.7/site-packages/pandas/core/frame.py", line 2733, in reindex
**kwargs)
File "/home/ec2-user/anaconda2/envs/cloudmatch/lib/python2.7/site-packages/pandas/core/generic.py", line 2515, in reindex
fill_value, copy).__finalize__(self)
File "/home/ec2-user/anaconda2/envs/cloudmatch/lib/python2.7/site-packages/pandas/core/frame.py", line 2679, in _reindex_axes
fill_value, limit, tolerance)
File "/home/ec2-user/anaconda2/envs/cloudmatch/lib/python2.7/site-packages/pandas/core/frame.py", line 2690, in _reindex_index
allow_dups=False)
File "/home/ec2-user/anaconda2/envs/cloudmatch/lib/python2.7/site-packages/pandas/core/generic.py", line 2627, in _reindex_with_indexers
copy=copy)
File "/home/ec2-user/anaconda2/envs/cloudmatch/lib/python2.7/site-packages/pandas/core/internals.py", line 3897, in reindex_indexer
for blk in self.blocks]
File "/home/ec2-user/anaconda2/envs/cloudmatch/lib/python2.7/site-packages/pandas/core/internals.py", line 1046, in take_nd
allow_fill=True, fill_value=fill_value)
File "/home/ec2-user/anaconda2/envs/cloudmatch/lib/python2.7/site-packages/pandas/core/algorithms.py", line 1467, in take_nd
out = np.empty(out_shape, dtype=dtype)
MemoryError

问:我可以使用 pandas dataframe 分块读取文件并使用列表删除索引吗?如果是这样怎么办?或者我错过了一些更好的方法。

非常感谢。

最佳答案

试试这个:

pd.read_csv('input_file', skiprows=example_ix_list).to_csv('input_file', index=False)

如果仍然出现 MemoryError,可以使用 chunksize 参数:

example_ix_list = pd.Index(example_ix_list)

for df in pd.read_csv('input_file', chunksize=10**5):
df.loc[df.index.difference(example_ix_list)] \
.to_csv('new_file_name', index=False, header=None, mode='a')

关于python - 从无法放入内存的巨 Pandas 数据框中删除索引,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49215252/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com