gpt4 book ai didi

python - 对单词进行计数和排名时,Python 中的大型 .txt 文件出现 MemoryError

转载 作者:行者123 更新时间:2023-12-01 09:17:26 24 4
gpt4 key购买 nike

我正在尝试从包含芬兰语文本的 500mb 文本文件创建排名单词列表 csv 文件。该脚本可以处理小文件,但不适用于 500mb 的文件。

我是 Python 的初学者,所以如果它相当马虎,请原谅我。环顾四周,我想我可能必须逐行处理文件。

with open(...) as f:
for line in f:
# Do something with 'line'

我将不胜感激任何指点,干杯!我的代码如下:

#load text
filename = 'finnish_text.txt'
file = open(filename, 'r')
text = file.read()
file.close()

#lowercase and split words by white space
lowercase = text.lower()
words = lowercase.split()

# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in words]

# ranked word count specify return amount here
from collections import Counter
Counter = Counter(stripped)
most_occur = Counter.most_common(100)

# export csv file
import csv
with open('word_rank.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
for x in most_occur:
writer.writerow(x)

编辑:我最终使用了@Bharel(多么传奇)在他的评论中给出的第二个解决方案。由于编码问题,我不得不更改几行。

with open(filename, 'r', encoding='Latin-1', errors='replace') as file:

with open('word_rank.csv', 'w', newline='', errors='replace') as csvfile:

最佳答案

将所有内容切换到生成器,它应该可以工作:

#load text
filename = 'finnish_text.txt'
# Auto-close when done
with open(filename, 'r') as file:

#lowercase and split words by white space
word_iterables =(text.lower().split() for line in file)

# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)

stripped = (w.translate(table) for it in word_iterables for w in it)

# ranked word count specify return amount here
from collections import Counter
counter = Counter(stripped)

most_occur = counter.most_common(100)

# export csv file
import csv
with open('word_rank.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
for x in most_occur:
writer.writerow(x)

通过使用生成器(括号而不是方括号),所有单词都会被延迟处理,而不是一次全部加载到内存中。

<小时/>

如果您想要最有效的方法,我已经写了一个作为 self 挑战:

import itertools
import operator

#load text
filename = 'finnish_text.txt'
# Auto-close when done
with open(filename, 'r') as file:

# Lowercase the lines
lower_lines = map(str.lower, file)

# Split the words in each line - will return [[word, word], [word, word]]
word_iterables = map(str.split, lower_lines)

# Combine the iterables:
# i.e. [[word, word], [word, word]] -> [word, word, word, word]
words = itertools.chain.from_iterable(word_iterables)

import string
table = str.maketrans('', '', string.punctuation)

# remove punctuation from each word
stripped = map(operator.methodcaller("translate", table), words)

# ranked word count specify return amount here
from collections import Counter
counter = Counter(stripped)

most_occur = counter.most_common(100)

# export csv file
import csv
with open('word_rank.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
for x in most_occur:
writer.writerow(x)

它充分利用了用 C 编写的生成器(map 和 itertools)。

关于python - 对单词进行计数和排名时,Python 中的大型 .txt 文件出现 MemoryError,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51114812/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com