gpt4 book ai didi

python - 使用正则表达式提取不同格式的日期并对它们进行排序 - pandas

转载 作者:太空宇宙 更新时间:2023-11-04 06:48:15 25 4
gpt4 key购买 nike

我是文本挖掘的新手,我需要从 *.txt 文件中提取日期并对它们进行排序。日期在句子之间(每一行),它们的格式可能如下所示:

04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010

如果缺少日期,请考虑 1 日,如果缺少月份,请考虑一月。

我的想法是提取所有日期并将其转换为 mm/dd/yyyy 格式。但是我对如何查找和替换模式有点怀疑。这就是我所做的:

import pandas as pd

doc = []
with open('dates.txt') as file:
for line in file:
doc.append(line)

df = pd.Series(doc)

df2 = pd.DataFrame(df,columns=['text'])

def myfunc(x):
if len(x)==4:
x = '01/01/'+x
else:
if not re.search('/',x):
example = re.sub('[-]','/',x)
terms = re.split('/',x)
if (len(terms)==2):
if len(terms[-1])==2:
x = '01/'+terms[0]+'/19'+terms[-1]
else:
x = '01/'+terms[0]+'/'+terms[-1]
elif len(terms[-1])==2:
x = terms[0].zfill(2)+'/'+terms[1].zfill(2)+'/19'+terms[-1]
return x

df2['text'] = df2.text.str.replace(r'(((?:\d+[/-])?\d+[/-]\d+)|\d{4})', lambda x: myfunc(x.groups('Date')[0]))

我只为数字日期格式做过。但我有点困惑如何使用字母数字日期。

我知道这是一个粗略的代码,但这正是我得到的。

最佳答案

我认为这是 coursera 文本挖掘作业之一。那么你可以使用正则表达式和提取来获得解决方案。 dates.txt

doc = []
with open('dates.txt') as file:
for line in file:
doc.append(line)

df = pd.Series(doc)

def date_sorter():
# Get the dates in the form of words
one = df.str.extract(r'((?:\d{,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{2,4})')
# Get the dates in the form of numbers
two = df.str.extract(r'((?:\d{1,2})(?:(?:\/|-)\d{1,2})(?:(?:\/|-)\d{2,4}))')
# Get the dates where there is no days i.e only month and year
three = df.str.extract(r'((?:\d{1,2}(?:-|\/))?\d{4})')
#Convert the dates to datatime and by filling the nans in two and three. Replace month name because of spelling mistake in the text file.
dates = pd.to_datetime(one.fillna(two).fillna(three).replace('Decemeber','December',regex=True).replace('Janaury','January',regex=True))
return pd.Series(dates.sort_values())

date_sorter()

输出:

9     1971-04-1084    1971-05-182     1971-07-0853    1971-07-1128    1971-09-12474   1972-01-01153   1972-01-1313    1972-01-26129   1972-05-0698    1972-05-13111   1972-06-10225   1972-06-1531    1972-07-20171   1972-10-04191   1972-11-30486   1973-01-01335   1973-02-01415   1973-02-0136    1973-02-14405   1973-03-01323   1973-03-01422   1973-04-01375   1973-06-01380   1973-07-01345   1973-10-0157    1973-12-01481   1974-01-01436   1974-02-01104   1974-02-24299   1974-03-01

如果您只想返回索引,则 return pd.Series(dates.sort_values().index)

第一个正则表达式的解析

 #?: Non-capturing group ((?:\d{,2}\s)? # The two digits group. `?` refers to preceding token or group. Here the digits of 2 or 1 and space occurring once or less.   (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* # The words in group ending with any letters `[]` occuring any number of times (`*`).  (?:-|\.|\s|,) # Pattern matching -,.,space  \s? #(`?` here it implies only to space i.e the preceding token) \d{,2}[a-z]* # less than or equal to two digits having any number of letters at the end (`*`). (Eg: may be 1st, 13th , 22nd , Jan , December etc ) .  (?:-|,|\s)?# The characters -/,/space may occur once and may not occur because of `?` at the end \s? # space may occur or may not occur at all (maximum is 1) (`?` here it refers only to space) \d{2,4}) # Match digit which is 2 or 4   

希望对您有所帮助。

关于python - 使用正则表达式提取不同格式的日期并对它们进行排序 - pandas,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46064162/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com