gpt4 book ai didi

python - 我怎样才能找到一个大字符串的最合适的子序列?

转载 作者:太空狗 更新时间:2023-10-29 21:45:16 25 4
gpt4 key购买 nike

假设我有一个大字符串和一个子字符串数组,它们在连接时等于大字符串(有细微差别)。

例如(注意字符串之间的细微差别):

large_str = "hello, this is a long string, that may be made up of multiple
substrings that approximately match the original string"

sub_strs = ["hello, ths is a lng strin", ", that ay be mad up of multiple",
"subsrings tat aproimately ", "match the orginal strng"]

如何最好地对齐字符串以从原始 large_str 生成一组新的子字符串?例如:

["hello, this is a long string", ", that may be made up of multiple",
"substrings that approximately ", "match the original string"]

附加信息

此用例是从 PDF 文档中提取的文本的现有分页符中查找原始文本的分页符。从 PDF 中提取的文本经过 OCR,与原始文本相比有小错误,但原始文本没有分页符。目标是准确分页原始文本,避免 PDF 文本的 OCR 错误。

最佳答案

  1. 连接子字符串
  2. 将拼接与原始字符串对齐
  3. 跟踪原始字符串中的哪些位置与子字符串之间的边界对齐
  4. 在与这些边界对齐的位置拆分原始字符串

使用 Python 的 difflib 实现:

from difflib import SequenceMatcher
from itertools import accumulate

large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"

sub_strs = [
"hello, ths is a lng strin",
", that ay be mad up of multiple",
"subsrings tat aproimately ",
"match the orginal strng"]

sub_str_boundaries = list(accumulate(len(s) for s in sub_strs))

sequence_matcher = SequenceMatcher(None, large_str, ''.join(sub_strs), autojunk = False)

match_index = 0
matches = [''] * len(sub_strs)

for tag, i1, i2, j1, j2 in sequence_matcher.get_opcodes():
if tag == 'delete' or tag == 'insert' or tag == 'replace':
matches[match_index] += large_str[i1:i2]
while j1 < j2:
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
while submatch_len == 0:
match_index += 1
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
j1 += submatch_len
else:
while j1 < j2:
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
while submatch_len == 0:
match_index += 1
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
matches[match_index] += large_str[i1:i1+submatch_len]
j1 += submatch_len
i1 += submatch_len

print(matches)

输出:

['hello, this is a long string', 
', that may be made up of multiple ',
'substrings that approximately ',
'match the original string']

关于python - 我怎样才能找到一个大字符串的最合适的子序列?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45990195/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com