gpt4 book ai didi

Python最快的n针到n haystacks字符串替换

转载 作者:塔克拉玛干 更新时间:2023-11-03 03:59:19 24 4
gpt4 key购买 nike

所以这就是我需要做的。

精简版:

在列表 A 中,用带下划线的子字符串版本替换列表 B 中每个出现的子字符串。

我有一个名为 Folder() 的类,用于保存数据。

class Folder():
dataset= [('question sentence', 'multiple word answer'),... n times]

list_of_answers=['answer','multiple_word_answer',... n times]





def insert_answers(folder):

temp_dataset=[]
for q,a in folder.dataset:
for answer in folder.list_of_answers:
#If answer is more than one word
if len(answer.split())>1:
answer_split=answer.split('(')
#Only use the first part of split and strip it of whitespaces
answer_split=answer_split[0].strip()
answer_=answer.replace(' ','_')
q=q.replace(answer_split,answer_)
temp_dataset.append([q,a])

folder.dataset=temp_dataset

如您所见,这非常慢,因为我有大约 435 000 个问题句子以及 list_of_answers 中的数千个答案

我需要 q,a 对保持在一起。

我打算对大约 144 个处理核心进行多处理以使其更快,但我想找到一个更快的算法。

示例输入:

questions=['pablo picasso painted guernica and random occurence of andy warhol so the question makes sense','andy warhol was born on ...']
list_of_answers=['pablo picasso','andy warhol (something)']

输出:

questions=['pablo_picasso painted guernica and random occurence of andy_warhol_(something) so the question makes sense','andy_warhol_(something) was born on ...']

最佳答案

这是使用正则表达式的直接实现。它解决了您的示例测试用例,但我不确定它对您的大型真实数据的效率如何。也没有处理重叠匹配(但是),但你还没有阐明如何处理这些。

测试用例:

questions=['pablo picasso painted guernica and random occurence of andy warhol so the question makes sense','andy warhol was born on ...']
list_of_answers=['pablo picasso','andy warhol']
desired = ['pablo_picasso painted guernica and random occurence of andy_warhol so the question makes sense','andy_warhol was born on ...']

解决方法:

import re
finder = r'\b(' + '|'.join(list_of_answers) + r')\b'
def underscorer(match):
return match.group().replace(' ', '_')
output = [re.sub(finder, underscorer, question) for question in questions]

测试:

>>> output == desired
True

关于Python最快的n针到n haystacks字符串替换,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30127355/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com