gpt4 book ai didi

python - 匹配包含月份的日期范围的正则表达式

转载 作者:太空宇宙 更新时间:2023-11-04 01:50:41 24 4
gpt4 key购买 nike

我需要匹配一个字符串来确定它是否是有效的日期范围,我的字符串可以包括文本中的月份和数字中的年份,没有特定的顺序(没有固定的格式,如 MM-YYYY-DD 等)。

A valid string could be:

2016 年 2 月 - 2019 年 3 月

2015 年 9 月至 2019 年 8 月

2015 年 4 月至今

2018 年 9 月至今

Invalid string:

乔治梅森大学 2019 年 8 月

Stratusburg 大学 2018 年 2 月

一些文本和月份后跟年份

我已经研究过诸如此类的问题一)Constructing Regular Expressions to match numeric ranges

b) Regex to match month name followed by year

和许多其他问题,但这些问题中的大多数输入字符串似乎都具有一些固定的月份和年份模式,而我没有。

我在 python 中尝试了这个正则表达式:

import re

pat = r"(\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\D?(\d{1,2}(st|nd|rd|th)?)?(([,.\-\/])\D?)?((19[7-9]\d|20\d{2})|\d{2})*"

st = "University of Pennsylvania February 2018"

re.search(pat, st)

但是从我的例子中识别有效和无效的字符串,我想在我的最终输出中避免无效的字符串。

对于输入“University of Pennsylvania February 2018”,预期输出应为 False

对于“2018 年 2 月至今”,输出必须为 True。

最佳答案

此 REGEX 验证符合此格式的日期范围 MONTH YEAR (MONTH YEAR | PRESENT)

import re
# just for complexity adding to valid range in first line
text = """
February 2016 - March 2019 February 2017 - March 2019
September 2015 to August 2019
April 2015 to present
September 2018 - present
George Mason University august 2019
Stratusburg university February 2018
Some text and month followed by year
"""
# writing the REGEX in one line will make it very UGLY
MONTHS_RE = ['Jan(?:uary)?', 'Feb(?:ruary)', 'Mar(?:ch)', 'Apr(?:il)?', 'May', 'Jun(?:e)?', 'Aug(?:ust)?', 'Sep(?:tember)?',
'(?:Nov|Dec)(?:ember)?']
# to match MONTH NAME and capture it (Jan(?:uary)?|Feb(?:ruary)...|(?:Nov|Dec)(?:ember)?)
RE_MONTH = '({})'.format('|'.join(MONTHS_RE))
# THIS MATCHE MONTH FOLLOWED BY YEAR{2 or 4} I will use two times in Final REGEXP
RE_DATE = '{RE_MONTH}(?:[\s]+)(\d{{2,4}})'.format(RE_MONTH=RE_MONTH)
# FINAL REGEX
RE_VALID_RANGE = re.compile('{RE_DATE}.+?(?:{RE_DATE}|(present))'.format(RE_DATE=RE_DATE), flags=re.IGNORECASE)


# if you want to extract both valid an invalide
valid_ranges = []
invalid_ranges = []
for line in text.split('\n'):
if line:
groups = re.findall(RE_VALID_RANGE, line)
if groups:
# If you want to do something with range
# all valid ranges are here my be 1 or 2 depends on the number of valid range in one line
# every group have 4 elements because there is 4 capturing group
# if M2,Y2 are not empty present is empty or the inverse only one of them is there (because of (?:{RE_DATE}|(present)) )
M1, Y1, M2, Y2, present = groups[0] # here use loop if you want to verify the values even more
valid_ranges.append(line)
else:
invalid_ranges.append(line)

print('VALID: ', valid_ranges)
print('INVALID:', invalid_ranges)


# this yields only valid ranges if there is 2 in one line will yield two valid ranges
# if you are dealing with lines this is not what you want
valid_ranges = []
for match in re.finditer(RE_VALID_RANGE, text):
# if you want to check the ranges
M1, Y1, M2, Y2, present = match.groups()
valid_ranges.append(match.group(0)) # the text is returned here
print('VALID USING <finditer>: ', valid_ranges)

输出:

VALID:  ['February 2016 - March 2019 February 2017 - March 2019', 'September 2015 to August 2019', 'April 2015 to present', 'September 2018 - present']
INVALID: ['George Mason University august 2019', 'Stratusburg university February 2018', 'Some text and month followed by year']
VALID USING <finditer>: ['February 2016 - March 2019', 'February 2017 - March 2019', 'September 2015 to August 2019', 'April 2015 to present', 'September 2018 - present']

我讨厌在单个 str 变量中编写长正则表达式 我喜欢在六个月后阅读我的代码时打破它以了解它的作用。注意第一行如何使用 finditer

分成两个有效范围字符串

如果你只想提取范围,你可以使用这个:

valid_ranges = re.findall(RE_VALID_RANGE, text)

但这会返回组 ([M1, Y1, M2, Y2, present)..] 而不是文本:

[('February', '2016', 'March', '2019', ''), ('February', '2017', 'March', '2019', ''), ('September', '2015', 'August', '2019', ''), ('April', '2015', '', '', 'present'), ('September', '2018', '', '', 'present')]

关于python - 匹配包含月份的日期范围的正则表达式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58108251/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com