gpt4 book ai didi

python - 如何唯一化大文本文件内容

转载 作者:行者123 更新时间:2023-11-28 20:43:39 25 4
gpt4 key购买 nike

我有一个包含 34,686,770 行的文本文件。所有行的长度都在 50 到 250 之间。有些行出现不止一条。我想让所有这些线条都独一无二。

我无法将所有这些行存储在一个列表中以使其唯一。我该怎么做。

Only has limited access to OBDII data stream unless you pay more money to upgrade the software.
I thought the author should have used more dialogue. It reads like a history book.
I thought the author should have used more dialogue. It reads like a history book.

我必须用唯一的行制作文件。

Only has limited access to OBDII data stream unless you pay more money to upgrade the software.
I thought the author should have used more dialogue. It reads like a history book.

我该怎么做?

最佳答案

不将所有文本存储在内存中:

with open('text.txt') as text:
with open('unique.txt', 'w') as output:
seen = set()
for line in text:
line_hash = hash(line)
if line_hash not in seen:
output.write(line)
seen.add(line_hash)

相反,我们存储的是文本的散列,它要小得多。当然,可能会发生哈希冲突,在这种情况下,此代码将跳过本应包含的唯一行。

关于python - 如何唯一化大文本文件内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28543279/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com