gpt4 book ai didi

python - 用于提取以 Mr.|Mrs|The|DR 开头的姓名的正则表达式

转载 作者:行者123 更新时间:2023-12-03 16:56:36 26 4
gpt4 key购买 nike

我试图写正则表达式来识别以 MR|MS|THE|DR 开头的名字
例如

      HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
1 VIKRAM NATH,HONOURABLE MR. JUSTICE 1 1 0 3 5
J.B.PARDIWALA
HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
2 VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M. 0 1 0 0 1
PANCHOLI
HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
3 VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH 107 4 10 6 127
J. SHASTRI
所以,输出应该是
[THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH, MR. JUSTICE J.B.PARDIWALA]
[THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH, MR. JUSTICE VIPUL M. PANCHOLI]
and so on
但我得到
THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH 
MR. JUSTICE 1 1 0 3 5
J.B.PARDIWALA
我试过 \s*HONOURABLE\s+(?=THE|MR|MS|DR)([^/\[\]\n]*)HONORABLE 可以重复任何编号。次。
任何帮助,将不胜感激
提前致谢!

最佳答案

赏金答案
您可以使用

import re
text = """ HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
1 VIKRAM NATH,HONOURABLE MR. JUSTICE 1 1 0 3 5
J.B.PARDIWALA
HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
2 VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M. 0 1 0 0 1
PANCHOLI
HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
3 VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH 107 4 10 6 127
J. SHASTRI"""
text = re.sub(r'^[\d \t]+|[\d \t]+$', '', text, flags=re.M)
#print(text)
m = re.findall(r'^HONOURABLE\s+(.*(?:\n(?!HONOURABLE\b).*)*)', text, re.M)
for x in m:
print(x.replace('\n',' '))
输出:
[
'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE J.B.PARDIWALA',
'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M. PANCHOLI',
'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH J. SHASTRI'
]
Python demo .
详情 :
  • re.sub(r'^[\d \t]+|[\d \t]+$', '', text, flags=re.M)从文本中每行的开头和结尾删除所有空格、制表符和数字。
  • r'^HONOURABLE\s+(.*(?:\n(?!HONOURABLE\b).*)*)'是与“修剪”文本中的以下内容匹配的正则表达式:
  • ^ - 一行的开始
  • HONOURABLE - 一句话HONOURABLE
  • \s+ - 一个或多个空格
  • (.*(?:\n(?!HONOURABLE\b).*)*) - 捕获组 1:
  • .* - 该行的其余部分
  • (?:\n(?!HONOURABLE\b).*)* - 零个或多个不以 HONOURABLE 开头的行作为一个整体。


  • 原答案
    您可以使用
    \bHONOURABLE\s+((?:THE|MR|MS|DR)[^,]*)
    regex demo .如果您不想在结果列表项中包含换行符,您可以稍后用 .replace('\n', ' ') 替换它们。 .如果您想在 [ 处限制比赛的右手边边界, \] ,将它们添加到否定字符类,更改 [^,][^][/,] .
    细节:
  • \bHONOURABLE - 一个字HONOURABLE
  • \s+ - 一个或多个空格
  • ((?:THE|MR|MS|DR)[^,]*) - 捕获组 1:THE , MR , MS , DR后跟零个或多个除逗号以外的字符。

  • Python demo :
    import re
    rx = r"\bHONOURABLE\s+((?:THE|MR|MS|DR)\b[^,]*)"
    text = "HONOURABLE THE CHIEF JUSTICE MR. JUSTICE\nVIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH\nJ. SHASTRI, HONOURABLE MS. ADITI GUPTA"
    m = re.findall(rx, text)
    print([x.replace('\n','') for x in m])
    输出:
    ['THE CHIEF JUSTICE MR. JUSTICEVIKRAM NATH', 'MR. JUSTICE ASHUTOSHJ. SHASTRI', 'MS. ADITI GUPTA']

    关于python - 用于提取以 Mr.|Mrs|The|DR 开头的姓名的正则表达式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66046399/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com