gpt4 book ai didi

python - pyarrow 内存泄漏?

转载 作者:太空宇宙 更新时间:2023-11-03 10:49:51 27 4
gpt4 key购买 nike

对于较大文件的解析,我需要循环写入大量的parquet文件。但是,似乎此任务消耗的内存在每次迭代中都会增加,而我希望它保持不变(因为内存中不应附加任何内容)。这使得扩展变得棘手。

我添加了一个最小可重现示例,它创建了 10 000 个 Parquet 并循环附加到它。

import resource
import random
import string
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd


def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))

schema = pa.schema([
pa.field('test', pa.string()),
])

resource.setrlimit(resource.RLIMIT_NOFILE, (1000000, 1000000))
number_files = 10000
number_rows_increment = 1000
number_iterations = 100

writers = [pq.ParquetWriter('test_'+id_generator()+'.parquet', schema) for i in range(number_files)]

for i in range(number_iterations):
for writer in writers:
table_to_write = pa.Table.from_pandas(
pd.DataFrame({'test': [id_generator() for i in range(number_rows_increment)]}),
preserve_index=False,
schema = schema,
nthreads = 1)
table_to_write = table_to_write.replace_schema_metadata(None)
writer.write_table(table_to_write)
print(i)

for writer in writers:
writer.close()

谁知道导致这种泄漏的原因以及如何防止这种泄漏?

最佳答案

我们不确定出了什么问题,但其他一些用户报告了尚未确诊的内存泄漏。我将您的示例添加到跟踪 JIRA 问题之一 https://issues.apache.org/jira/browse/ARROW-3324

关于python - pyarrow 内存泄漏?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53016802/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com