gpt4 book ai didi

python - 在 Python 中使用 Dateutil 时提取某些日期格式失败

转载 作者:太空宇宙 更新时间:2023-11-03 14:46:08 24 4
gpt4 key购买 nike

在发布此问题之前,我已经浏览了多个链接,因此请仔细阅读,下面是解决了我 90% 问题的两个答案:

parse multiple dates using dateutil

How to parse multiple dates from a block of text in Python (or another language)

问题:我需要在 Python 中解析多种格式的多个日期

上述链接的解决方案:我可以这样做,但仍然有某些格式我无法这样做。

仍然无法解析的格式有:

  1. text ='我想在 5 月 16 日至 5 月 18 日期间访问'

  2. text ='我想在 5 月 16 日至 18 日期间参观'

  3. text ='我想从 2018 年 5 月 6 日起访问'

我也尝试过正则表达式,但由于日期可以采用任何格式,因此排除了该选项,因为代码变得非常复杂。因此,请建议我修改链接上提供的代码,以便也可以在同一链接上处理上述 3 种格式。

最佳答案

此类问题总是需要调整新的边缘情况,但以下方法相当稳健:

from itertools import groupby, izip_longest
from datetime import datetime, timedelta
import calendar
import string
import re


def get_date_part(x):
if x.lower() in month_list:
return x

day = re.match(r'(\d+)(\b|st|nd|rd|th)', x, re.I)

if day:
return day.group(1)

return False


def month_full(month):
try:
return datetime.strptime(month, '%B').strftime('%b')
except:
return datetime.strptime(month, '%b').strftime('%b')

tests = [
'I want to visit from May 16-May 18',
'I want to visit from May 16-18',
'I want to visit from May 6 May 18',
'May 6,7,8,9,10',
'8 May to 10 June',
'July 10/20/30',
'from June 1, july 5 to aug 5 please',
'2nd March to the 3rd January',
'15 march, 10 feb, 5 jan',
'1 nov 2017',
'27th Oct 2010 until 1st jan',
'27th Oct 2010 until 1st jan 2012'
]

cur_year = 2017

month_list = [m.lower() for m in list(calendar.month_name) + list(calendar.month_abbr) if len(m)]
remove_punc = string.maketrans(string.punctuation, ' ' * len(string.punctuation))

for date in tests:
date_parts = [get_date_part(part) for part in date.translate(remove_punc).split() if get_date_part(part)]

days = []
months = []
years = []

for k, g in groupby(sorted(date_parts, key=lambda x: x.isdigit()), lambda y: not y.isdigit()):
values = list(g)

if k:
months = map(month_full, values)
else:
for v in values:
if 1900 <= int(v) <= 2100:
years.append(int(v))
else:
days.append(v)

if days and months:
if years:
dates_raw = [datetime.strptime('{} {} {}'.format(m, d, y), '%b %d %Y') for m, d, y in izip_longest(months, days, years, fillvalue=years[0])]
else:
dates_raw = [datetime.strptime('{} {}'.format(m, d), '%b %d').replace(year=cur_year) for m, d in izip_longest(months, days, fillvalue=months[0])]
years = [cur_year]

# Fix for jumps in year
dates = []
start_date = datetime(years[0], 1, 1)
next_year = years[0] + 1

for d in dates_raw:
if d < start_date:
d = d.replace(year=next_year)
next_year += 1
start_date = d
dates.append(d)

print "{} -> {}".format(date, ', '.join(d.strftime("%d/%m/%Y") for d in dates))

这会按如下方式转换测试字符串:

I want to visit from May 16-May 18  ->  16/05/2017, 18/05/2017
I want to visit from May 16-18 -> 16/05/2017, 18/05/2017
I want to visit from May 6 May 18 -> 06/05/2017, 18/05/2017
May 6,7,8,9,10 -> 06/05/2017, 07/05/2017, 08/05/2017, 09/05/2017, 10/05/2017
8 May to 10 June -> 08/05/2017, 10/06/2017
July 10/20/30 -> 10/07/2017, 20/07/2017, 30/07/2017
from June 1, july 5 to aug 5 please -> 01/06/2017, 05/07/2017, 05/08/2017
2nd March to the 3rd January -> 02/03/2017, 03/01/2018
15 march, 10 feb, 5 jan -> 15/03/2017, 10/02/2018, 05/01/2019
1 nov 2017 -> 01/11/2017
27th Oct 2010 until 1st jan -> 27/10/2010, 01/01/2011
27th Oct 2010 until 1st jan 2012 -> 27/10/2010, 01/01/2012

其工作原理如下:

  1. 首先创建有效月份名称列表,即完整名称和缩写名称。

  2. 制作一个翻译表,以便轻松快速地从文本中删除任何标点符号。

  3. 分割文本,并使用带有正则表达式的函数来识别日期或月份,仅提取日期部分。

  4. 根据该部分是否为数字对列表进行排序,这会将月份分组到前面,将数字分组到末尾。

  5. 获取每个列表的第一部分和最后一部分。将月份转换为完整形式,例如AugAugust 并将每个转换为 datetime 对象。

  6. 如果某个日期早于前一个日期,请添加一整年。

关于python - 在 Python 中使用 Dateutil 时提取某些日期格式失败,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46220123/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com