gpt4 book ai didi

python - 替换文本文件中的标记列表的最佳方法

转载 作者:行者123 更新时间:2023-12-03 18:33:53 25 4
gpt4 key购买 nike

我有一个文本文件(没有标点符号),文件大小约为 100MB - 1GB,这是一些示例行:

please check in here
i have a full hd movie
see you again bye bye
press ctrl c to copy text to clipboard
i need your help
...

并带有替换 token 列表,如下所示:
check in -> check_in
full hd -> full_hd
bye bye -> bye_bye
ctrl c -> ctrl_c
...

替换文本文件后我想要的输出如下:
please check_in here
i have a full_hd movie
see you again bye_bye
press ctrl_c to copy text to clipboard
i need your help
...

我目前的做法

replace_tokens = {'ctrl c': 'ctrl_c', ...} # a python dictionary
for line in open('text_file'):
for token in replace_tokens:
line = re.sub(r'\b{}\b'.format(token), replace_tokens[token])
# Save line to file

此解决方案有效,但对于大量替换标记和大型文本文件来说,这非常慢。有没有更好的解决方案?

最佳答案

您至少可以通过执行以下操作来消除内部循环的复杂性:

import re 

tokens={"check in":"check_in", "full hd":"full_hd",
"bye bye":"bye_bye","ctrl c":"ctrl_c"}

regex=re.compile("|".join([r"\b{}\b".format(t) for t in tokens]))

with open(your_file) as f:
for line in f:
line=regex.sub(lambda m: tokens[m.group(0)], line.rstrip())
print(line)

打印:
please check_in here
i have a full_hd movie
see you again bye_bye
press ctrl_c to copy text to clipboard
i need your help

关于python - 替换文本文件中的标记列表的最佳方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62441317/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com