gpt4 book ai didi

python - 大文本文件的并行计算

转载 作者:行者123 更新时间:2023-12-01 06:46:54 24 4
gpt4 key购买 nike

我试图在非常大的文本文件中找到一些拼写错误并纠正它们。基本上,我运行这段代码:

ocr = open("text.txt")
text = ocr.readlines()
clean_text = []
for line in text:
last = re.sub("^(\\|)([0-9])(\\s)([A-Z][a-z]+[a-z])\\,", "1\\2\t\\3\\4,", line)
clean_text.append(last)
new_text = open("new_text.txt", "w", newline="\n")
for line in clean_text:
new_text.write(line)
new_text.close()

实际上,我使用“re.sub”函数超过 1500 次,“text.txt”有 100.000 行。我可以将文本分成几部分并为不同部分使用不同的核心吗?

最佳答案

这会将文本处理函数(当前使用问题中的 re.sub)应用于输入文本文件的 NUM_CORES 个相同大小的 block ,然后将它们写出(保留原始文本输入文件中的顺序)。

from multiprocessing import Pool, cpu_count
import numpy as np
import re

NUM_CORES = cpu_count()

def process_text(input_textlines):
clean_text = []
for line in input_textlines:
cleaned = re.sub("^(\\|)([0-9])(\\s)([A-Z][a-z]+[a-z])\\,", "1\\2\t\\3\\4,", line)
clean_text.append(cleaned)
return "".join(clean_text)

# read in data and convert to sequence of equally-sized chunks
with open('data/text.txt', 'r') as f:
lines = f.readlines()

num_lines = len(lines)
text_chunks = np.array_split(lines, NUM_CORES)

# process each chunk in parallel
pool = Pool(NUM_CORES)
results = pool.map(process_text, text_chunks)

# write out results
with open("new_text.txt", "w", newline="\n") as f:
for text_chunk in results:
f.write(text_chunk)

关于python - 大文本文件的并行计算,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59185357/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com