gpt4 book ai didi

python初学者-在大文件中查找和替换的更快方法?

转载 作者:太空狗 更新时间:2023-10-29 21:18:55 24 4
gpt4 key购买 nike

我有一个大约 1 亿行的文件,我想用存储在制表符分隔文件中的替代文本替换其中的文本。我的代码有效,但处理前 70K 行大约需要一个小时。在尝试逐步提高我的 python 技能时,我想知道是否有更快的方法来执行此操作。谢谢!输入文件看起来像这样:

CHROMOSOME_IV ncRNA gene 5723085 5723105 . - . ID=Gene:WBGene00045518 CHROMOSOME_IV ncRNA ncRNA 5723085 5723105 . - . Parent=Gene:WBGene00045518

具有替换值的文件如下所示:

WBGene00045518 21ur-5153

这是我的代码:

infile1 = open('f1.txt', 'r')
infile2 = open('f2.txt', 'r')
outfile = open('out.txt', 'w')

import re
from datetime import datetime
startTime = datetime.now()

udict = {}
for line in infile1:
line = line.strip()
linelist = line.split('\t')
udict1 = {linelist[0]:linelist[1]}
udict.update(udict1)

mult10K = []
for x in range(100):
mult10K.append(x * 10000)
linecounter = 0
for line in infile2:
for key, value in udict.items():
matches = line.count(key)
if matches > 0:
print key, value
line = line.replace(key, value)
outfile.write(line + '\n')
else:
outfile.write(line + '\n')
linecounter += 1
if linecounter in mult10K:
print linecounter
print (datetime.now()-startTime)
infile1.close()
infile2.close()
outfile.close()

最佳答案

你应该把你的台词分成“词”,并且只在你的字典中查找这些词:

>>> re.findall(r"\w+", "CHROMOSOME_IV ncRNA gene 5723085 5723105 . - . ID=Gene:WBGene00045518 CHROMOSOME_IV ncRNA ncRNA 5723085 5723105 . - . Parent=Gene:WBGene00045518")
['CHROMOSOME_IV', 'ncRNA', 'gene', '5723085', '5723105', 'ID', 'Gene', 'WBGene00045518', 'CHROMOSOME_IV', 'ncRNA', 'ncRNA', '5723085', '5723105', 'Parent', 'Gene', 'WBGene00045518']

这将消除您为每一行所做的字典循环。

完整代码如下:

import re

with open("f1.txt", "r") as infile1:
udict = dict(line.strip().split("\t", 1) for line in infile1)

with open("f2.txt", "r") as infile2, open("out.txt", "w") as outfile:
for line in infile2:
for word in re.findall(r"\w+", line):
if word in udict:
line = line.replace(word, udict[word])
outfile.write(line)

编辑:另一种方法是从您的字典构建一个单一的巨型正则表达式:

with open("f1.txt", "r") as infile1:
udict = dict(line.strip().split("\t", 1) for line in infile1)
regex = re.compile("|".join(map(re.escape, udict)))
with open("f2.txt", "r") as infile2, open("out.txt", "w") as outfile:
for line in infile2:
outfile.write(regex.sub(lambda m: udict[m.group()], line))

关于python初学者-在大文件中查找和替换的更快方法?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10249900/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com