gpt4 book ai didi

python-3.x - 比较包含字符串的数据框行

转载 作者:行者123 更新时间:2023-12-04 10:56:17 25 4
gpt4 key购买 nike

考虑这个数据框:

id     name           date_time                 strings   
1 'AAA' 2018-08-03 18:00:00 1125,1517,656,657
1 'AAA' 2018-08-03 18:45:00 128,131,646,535,157,159
1 'AAA' 2018-08-03 18:49:00 131
1 'BBB' 2018-08-03 19:41:00 0
1 'BBB' 2018-08-05 19:30:00 0
1 'AAA' 2018-08-04 11:00:00 131
1 'AAA' 2018-08-04 11:30:00 1000
1 'AAA' 2018-08-04 11:33:00 1000,5555

首先,我想检查共享 id 和 name 的行组,如果每个连续行之间有一个公共(public)字符串,则匹配为 True(某些字符串列没有值,因此它们已被 0 填充。所需的输出:
id     name           date_time                 strings                    match       
1 'AAA' 2018-08-03 18:00:00 1125,128,1517,656,657 False
1 'AAA' 2018-08-03 18:45:00 128,131,646,535,157,159 True
1 'AAA' 2018-08-03 18:49:00 131 True
1 'BBB' 2018-08-03 19:41:00 0 False
1 'BBB' 2018-08-05 19:30:00 0 False
1 'AAA' 2018-08-04 11:00:00 131 True
1 'AAA' 2018-08-04 11:30:00 1000 False
1 'AAA' 2018-08-04 11:33:00 1000,5555 True

然后按 id 和 name 对行进行分组,并找到匹配值为 True 的每个连续行之间的时间差,如果时间差小于 00:05:00,则标志为 1。最终输出:
id     name           date_time                 strings                    diff        flag      
1 'AAA' 2018-08-03 18:00:00 1125,128,1517,656,657 00:00:00 0
1 'AAA' 2018-08-03 18:45:00 128,131,646,535,157,159 00:00:00 0
1 'AAA' 2018-08-03 18:49:00 131 00:04:00 1
1 'BBB' 2018-08-03 19:41:00 0 00:00:00 0
1 'BBB' 2018-08-05 19:30:00 0 00:00:00 0
1 'AAA' 2018-08-04 11:00:00 131 16:15:00 0
1 'AAA' 2018-08-04 11:30:00 1000 00:00:00 0
1 'AAA' 2018-08-04 11:33:00 1000,5555 00:33:00 0

对于第一部分,我已经尝试过这段代码,但它不能正常工作:
grouped = df.groupby(['id','name'])
z = []
for index,row in grouped:
z.append(list(zip(row['strings'], row['strings'].shift())))
df['match'] = [bool(set(str(s1).split(','))& set(str(s2).split(','))) for i in range(len(z)) for s1,s2 in z[i]]

对于第二部分,我尝试了不同的解决方案,但没有一个有效。

任何提示表示赞赏。

最佳答案

如果您想将 cad 锐化与前一种使用进行比较:

dummies=df.strings.str.get_dummies(',')
c1=df['strings'].ne('0')
c2=( dummies.groupby([df['id'],df['name']]).shift().eq(dummies) & dummies.ge(1) ).any(axis=1)
df['match']=c1&c2
df['diff']=( df.groupby(['id','name','match'])['date_time']
.diff()
.where(df['match'])
.fillna(pd.Timedelta(hours=0)) )
print(df)

id name date_time strings match diff
0 1 'AAA' 2018-08-03 18:00:00 1125,128,1517,656,657 False 00:00:00
1 1 'AAA' 2018-08-03 18:45:00 128,131,646,535,157,159 True 00:00:00
2 1 'AAA' 2018-08-03 18:49:00 131 True 00:04:00
3 1 'BBB' 2018-08-03 19:41:00 0 False 00:00:00
4 1 'BBB' 2018-08-05 19:30:00 0 False 00:00:00
5 1 'AAA' 2018-08-04 11:00:00 131 True 16:11:00
6 1 'AAA' 2018-08-04 11:30:00 1000 False 00:00:00
7 1 'AAA' 2018-08-04 11:33:00 1000,5555 True 00:33:00

如果要将每一行与相邻行进行比较:
dummies=df.strings.str.get_dummies(',')
c1=df['strings'].ne('0') # or df['strings'].ne(0)
c2=( (dummies.groupby([df['id'],df['name']],as_index=False)
.rolling(3,center=True,min_periods=1)
.sum()
.gt(1) ).any(axis=1)
.reset_index(level=0,drop='level_0') )
df['match']=c1&c2
df['diff']=( df.groupby(['id','name','match'])['date_time']
.diff()
.where(df['match'])
.fillna(pd.Timedelta(hours=0)) )
print(df)

输出
   id   name           date_time                  strings  match     diff
0 1 'AAA' 2018-08-03 18:00:00 1125,1517,656,657 False 00:00:00
1 1 'AAA' 2018-08-03 18:45:00 128,131,646,535,157,159 True 00:00:00
2 1 'AAA' 2018-08-03 18:49:00 131 True 00:04:00
3 1 'BBB' 2018-08-03 19:41:00 0 False 00:00:00
4 1 'BBB' 2018-08-05 19:30:00 0 False 00:00:00
5 1 'AAA' 2018-08-04 11:00:00 131 True 16:11:00
6 1 'AAA' 2018-08-04 11:30:00 1000 True 00:30:00
7 1 'AAA' 2018-08-04 11:33:00 1000,5555 True 00:03:00

关于python-3.x - 比较包含字符串的数据框行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59162659/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com