gpt4 book ai didi

python - pandas 字符串以列的值结尾,然后将字符串的开头转换为日期进行比较

转载 作者:行者123 更新时间:2023-12-01 09:05:17 25 4
gpt4 key购买 nike

我有以下df

cluster_id    amount    inv_id            inv_date
1 309.9 07121830990 2018-07-12
1 309.9 07121830990 2018-07-12
2 3130.0 20180501313000B 2018-05-01
2 3130.0 20180501313000B 2018-05-01
3 3330.50 201804253330.50 2018-04-25
3 3330.50 201804253330.50 2018-04-25
4 70.0 61518 2018-06-15
4 70.0 61518 2018-06-15
5 100.0 011318 2018-01-13
5 100.0 011318 2018-01-13
6 50.0 12202017 2017-12-20
6 50.0 12202017 2017-12-20
7 101.0 0000014482 2017-10-01
7 101.0 0000014482 2017-10-01

我想通过 groupby cluster_id 创建一个 bool 列 dummy_inv_id,并将 dummy_invoice_id 设置为 如果对于每个组,则为 True

1. inv_id (stripped non-numerics) ends with amount and the remaining part of inv_id can be coerced into a valid date which is +/- 180 days of the inv_date

2. inv_id (stripped non-numerics) can be coerced into a date which is +/- 180 days of the inv_date 

首先,我将从 inv_idgroupby cluster_id 中删除所有非数字字符

df['inv_id_stp'] = df.inv_id.str.replace(r'\D+', '')
grouped = df.groupby('cluster_id')

然后将amount * 100转为字符串,方便匹配

df['amount'] = df['amount']*100
df['amt_str'] = df['amount'].apply(str)

例如309.9'30990'3130.0'313000',这里我想知道如何检查 inv_id 这里以 amount 结尾,然后如何检查 inv_id 的剩余部分是否可以转换为 datetime 并在inv_date 的+/-180 天,或者inv_id 可以直接转换为日期。特别是有一些日期格式,即

071218 - 2018-07-12
20180501 - 2018-05-01
61518 - 2018-06-15
12202017 - 2017-12-20
0000014482 - cannot be converted to date

结果df看起来像,

cluster_id    amount    inv_id            inv_date      dummy_inv_id
1 309.9 07121830990 2018-07-12 True
1 309.9 07121830990 2018-07-12 True
2 3130.0 20180501313000B 2018-05-01 True
2 3130.0 20180501313000B 2018-05-01 True
3 3330.50 201804253330.50 2018-04-25 True
3 3330.50 201804253330.50 2018-04-25 True
4 70.0 61518 2018-06-15 True
4 70.0 61518 2018-06-15 True
5 100.0 011318 2018-01-13 True
5 100.0 011318 2018-01-13 True
6 50.0 12202017 2017-12-20 True
6 50.0 12202017 2017-12-20 True
7 101.0 0000014482 2017-10-01 False
7 101.0 0000014482 2017-10-01 False

最佳答案

想法是创建辅助字典,其中包含可能的日期时间格式以及用于切片和列表理解转换的字母数量 - errors='coerce' 为不匹配创建 NaT值:

from functools import reduce

#add zeros to length 6
s = df.inv_id.str.replace(r'\D+', '').str.zfill(6)

formats = {'%m%d%y':6,
'%y%m%d':6,
'%Y%m%d':8,
'%m%d%Y':8}
L = [pd.to_datetime(s.str[:v], format=k, errors='coerce') for k,v in formats.items()]

但某些格式应该转换不好,因此这些超出范围的日期时间会转换为 NaT:

L = [x.where(x.between('2000-01-01', pd.datetime.now())) for x in L]

并将所有非 NaT 值组合在一起 Series.combine_first :

s2 = reduce(lambda l,r: pd.Series.combine_first(l,r), L)
print (s2)
0 2018-07-12
1 2018-07-12
2 2018-05-01
3 2018-05-01
4 2018-04-25
5 2018-04-25
6 2018-06-15
7 2018-06-15
8 2018-01-13
9 2018-01-13
10 2017-12-20
11 2017-12-20
12 NaT
13 NaT
Name: inv_id, dtype: datetime64[ns]

最后一次检查+-180天:

df['new'] = s2.between(s2 - pd.Timedelta(180, unit='d'), s2 + pd.Timedelta(180, unit='d'))
<小时/>
print (df)
cluster_id amount inv_id inv_date new
0 1 309.9 07121830990 2018-07-12 True
1 1 309.9 07121830990 2018-07-12 True
2 2 3130.0 20180501313000B 2018-05-01 True
3 2 3130.0 20180501313000B 2018-05-01 True
4 3 3330.5 201804253330.50 2018-04-25 True
5 3 3330.5 201804253330.50 2018-04-25 True
6 4 70.0 61518 2018-06-15 True
7 4 70.0 61518 2018-06-15 True
8 5 100.0 011318 2018-01-13 True
9 5 100.0 011318 2018-01-13 True
10 6 50.0 12202017 2017-12-20 True
11 6 50.0 12202017 2017-12-20 True
12 7 101.0 0000014482 2017-10-01 False
13 7 101.0 0000014482 2017-10-01 False

编辑:

添加了从末尾删除子字符串的解决方案:

import re
from functools import reduce

df['amt_str'] = (df['amount']*100).round().astype(int).astype(str)
df['inv_str'] = df.inv_id.str.replace(r'\D+', '').str.zfill(6)

#https://stackoverflow.com/a/1038845/2901002
df['inv_str'] = df.apply(lambda x: re.sub('{}$'.format(x['amt_str']),'', x['inv_str']),axis=1)
print (df)
cluster_id amount inv_id inv_date amt_str inv_str
0 1 309.9 07121830990 2018-07-12 30990 071218
1 1 309.9 07121830990 2018-07-12 30990 071218
2 2 3130.0 20180501313000B 2018-05-01 313000 20180501
3 2 3130.0 20180501313000B 2018-05-01 313000 20180501
4 3 3330.5 201804253330.50 2018-04-25 333050 20180425
5 3 3330.5 201804253330.50 2018-04-25 333050 20180425
6 4 70.0 61518 2018-06-15 7000 061518
7 4 70.0 61518 2018-06-15 7000 061518
8 5 100.0 011318 2018-01-13 10000 011318
9 5 100.0 011318 2018-01-13 10000 011318
10 6 50.0 12202017 2017-12-20 5000 12202017
11 6 50.0 12202017 2017-12-20 5000 12202017
12 7 101.0 0000014482 2017-10-01 10100 0000014482
13 7 101.0 0000014482 2017-10-01 10100 0000014482
<小时/>
formats = {'%m%d%y':6, 
'%y%m%d':6,
'%Y%m%d':8,
'%m%d%Y':8}
L=[pd.to_datetime(df['inv_str'].str[:v],format=k, errors='coerce') for k,v in formats.items()]
L = [x.where(x.between('2000-01-01', pd.datetime.now())) for x in L]

s2 = reduce(lambda l,r: pd.Series.combine_first(l,r), L)
df['new'] = s2.between(s2 - pd.Timedelta(180, unit='d'), s2 + pd.Timedelta(180, unit='d'))
print (df)
cluster_id amount inv_id inv_date amt_str inv_str new
0 1 309.9 07121830990 2018-07-12 30990 071218 True
1 1 309.9 07121830990 2018-07-12 30990 071218 True
2 2 3130.0 20180501313000B 2018-05-01 313000 20180501 True
3 2 3130.0 20180501313000B 2018-05-01 313000 20180501 True
4 3 3330.5 201804253330.50 2018-04-25 333050 20180425 True
5 3 3330.5 201804253330.50 2018-04-25 333050 20180425 True
6 4 70.0 61518 2018-06-15 7000 061518 True
7 4 70.0 61518 2018-06-15 7000 061518 True
8 5 100.0 011318 2018-01-13 10000 011318 True
9 5 100.0 011318 2018-01-13 10000 011318 True
10 6 50.0 12202017 2017-12-20 5000 12202017 True
11 6 50.0 12202017 2017-12-20 5000 12202017 True
12 7 101.0 0000014482 2017-10-01 10100 0000014482 False
13 7 101.0 0000014482 2017-10-01 10100 0000014482 False

关于python - pandas 字符串以列的值结尾,然后将字符串的开头转换为日期进行比较,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52113181/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com