gpt4 book ai didi

Python Regex - 在文本文件中的(多个)表达式之间提取文本

转载 作者:行者123 更新时间:2023-12-01 08:49:58 25 4
gpt4 key购买 nike

我是一名 Python 初学者,如果您能帮助我解决文本提取问题,我将非常感激。

我想提取文本文件中两个表达式之间的所有文本(字母的开头和结尾)。对于两者,字母的开头和结尾都有多种可能的表达式(在列表“letter_begin”和“letter_end”中定义,例如“Dear”、“to our”等)。我想分析一堆文件,在下面找到一个这样的文本文件的示例 -> 我想提取从“亲爱的”到“道格拉斯”的所有文本。如果“letter_end”不匹配,即未找到 letter_end 表达式,则输出应从 letter_beginning 开始,到要分析的文本文件的最末尾结束。

编辑:“录制的文本”的结尾必须位于“letter_end”匹配之后且第一行 20 个或更多字符之前(“这里也是随机文本”的情况 -> len= 24.

"""Some random text here
 
Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards
Douglas

Random text here as well"""

这是我到目前为止的代码 - 但它无法灵活地捕获表达式之间的文本(在“letter_begin”之前和“letter_end”之后可以有任何内容(行、文本、数字、符号等) ”)

import re

letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter
closings = "|".join(letter_end)
regex = r"(?:" + openings + r")\s+.*?" + r"(?:" + closings + r"),\n\S+"


with open(filename, 'r', encoding="utf-8") as infile:
text = infile.read()
text = str(text)
output = re.findall(regex, text, re.MULTILINE|re.DOTALL|re.IGNORECASE) # record all text between Regex (Beginning and End Expressions)
print (output)

非常感谢您的每一次帮助!

最佳答案

您可以使用

regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)

此模式将产生类似的正则表达式

(?:dear|to our|estimated)[\s\S]*?(?:sincerely|yours|best regards).*(?:\n.*){0,2}

请参阅regex demo 。请注意,您不应将 re.DOTALL 与此模式一起使用,并且 re.MULTILINE 选项也是多余的。

详细信息

  • (?:dear|to our|estimated) - 三个值中的任何一个
  • [\s\S]*? - 任意 0 个以上字符,尽可能少
  • (?:sincerely|yours|最诚挚的问候) - 三个值中的任何一个
  • .* - 除换行符之外的任何 0 个以上字符
  • (?:\n.*){0,2} - 零次、一次或两次换行符重复,后跟除换行符之外的任何 0 个以上字符。

Python demo code :

import re
text="""Some random text here

Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards
Douglas

Random text here as well"""
letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter
closings = "|".join(letter_end)
regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)
print(regex)
print(re.findall(regex, text, re.IGNORECASE))

输出:

['Dear Shareholders We\nare pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.\nBest regards \nDouglas\n']

关于Python Regex - 在文本文件中的(多个)表达式之间提取文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53169493/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com