gpt4 book ai didi

python - 如果使用 Pandas 在数据框中再次重复相同的设置模式,如何分配唯一的列名?

转载 作者:行者123 更新时间:2023-12-02 02:39:20 24 4
gpt4 key购买 nike

我正在尝试使用以下逻辑创建新的列Group(Cluster)。

LOgic:脚本将检查供应商,文本字段和天列,如果供应商,文本和天值小于等于2,则记录将被分组为1个群集

我的代码

data['Date']=pd.to_datetime(data['Date'],infer_datetime_format=True)
data['Days']=(data['Date'].diff(1).dt.days).fillna(0)
data['Text']=data['Text'].fillna('No Value')
data['Vendor']=data['Vendor'].fillna('No Value')
diff= lambda x: x.diff().fillna(0).gt(2).cumsum()
t = data.groupby(['Text', 'Vendor']).Date_Difference.transform(diff)
g = data.groupby(['Text', 'Vendor', t], sort=False).ngroup()
data=data.assign(Group=g.add(1).astype(str).radd('Cluster'))

我当前的输出
 Invoice    Date    Text    Vendor  Days    Group       
1234567 1/1/2012 Repairs A 0 Cluster1
1234568 2/1/2012 Repairs A 1 Cluster1
1234569 4/1/2012 Repairs A 2 Cluster1
1234570 6/1/2012 Water A 2 Cluster2
1234571 9/1/2012 Water A 3 Cluster2
1234572 9/1/2012 Car A 0 Cluster3
1234573 9/1/2012 Bus A 0 Cluster4
1234574 9/1/2012 Bike A 0 Cluster5
1234575 9/1/2012 Repairs A 0 Cluster6
1234576 10/1/2012 Repairs A 1 Cluster6
1234577 11/1/2012 Repairs A 1 Cluster6
1234578 12/1/2012 Water A 2 Cluster6
1234579 13/1/2012 Water A 1 Cluster2
1234580 14/1/2012 Water A 1 Cluster2

预期输出
 Invoice    Date        Text    Vendor  Days    Group
1234567 1/1/2012 Repairs A 0 Cluster1
1234568 2/1/2012 Repairs A 1 Cluster1
1234569 4/1/2012 Repairs A 2 Cluster1
1234570 6/1/2012 Water A 2 Cluster2
1234571 9/1/2012 Water A 3 Cluster2
1234572 9/1/2012 Car A 0 No Cluster
1234573 9/1/2012 Bus A 0 No Cluster
1234574 9/1/2012 Bike A 0 No Cluster
1234575 9/1/2012 Repairs A 0 Cluster3
1234576 10/1/2012 Repairs A 1 Cluster3
1234577 11/1/2012 Repairs A 1 Cluster3
1234578 12/1/2012 Water A 2 Cluster4
1234579 13/1/2012 Water A 1 Cluster4
1234580 14/1/2012 Water A 1 Cluster4

测试数据
  Invoice     Date      Text   Vendor   Days    Group   Expected Group
1000001 1/1/2012 Repair A 0 Cluster1 Cluster1
1000003 2/1/2012 Repair A 1 Cluster1 Cluster1
1000005 4/1/2012 Repair A 2 Cluster1 Cluster1
1000007 6/1/2012 Water A 2 No Cluster No Cluster
1000008 9/2/2012 Repair A 34 Cluster2 No Cluster
1000010 9/2/2012 Garden A 0 Cluster3 Cluster2
1000011 10/2/2012 Garden A 1 Cluster3 Cluster2
1000012 15/2/2012 Car A 5 Cluster4 Cluster3
1000013 16/2/2012 Car A 1 Cluster4 Cluster3
1000015 17/2/2012 Car A 1 Cluster4 Cluster3
1234574 17/2/2012 Bike A 0 No Cluster No Cluster

如何在python中完成?

最佳答案

想法是通过g1Text列为连续的组创建新的Series Vendor,并仅填充由助手Series g1复制的,最后通过 Series.reindex 添加不匹配的值:

data['Date']=pd.to_datetime(data['Date'],infer_datetime_format=True, dayfirst=True)
data.sort_values(['Vendor','Date'],inplace=True)
data['Date_Difference']=(data['Date'].diff(1).dt.days).fillna(0)
data['Text']=data['Text'].fillna('No Value')
data['Vendor']=data['Vendor'].fillna('No Value')
diff= lambda x: x.diff().fillna(0).gt(2).cumsum()
t = data.groupby(['Text', 'Vendor']).Date_Difference.transform(diff)

g1 = data[['Text', 'Vendor']].ne(data[['Text', 'Vendor']].shift()).any(axis=1).cumsum()
m = g1.duplicated(keep=False)

g = data[m].groupby([g1, t], sort=False).ngroup()
clust = g.add(1).astype(str).radd('Cluster').reindex(data.index, fill_value='No Cluster')

data=data.assign(Group=clust)
print (data)
Invoice Date Text Vendor Days Group Date_Difference
0 1000001 2012-01-01 Repair A 0 Cluster1 0.0
1 1000003 2012-01-02 Repair A 1 Cluster1 1.0
2 1000005 2012-01-04 Repair A 2 Cluster1 2.0
3 1000007 2012-01-06 Water A 2 No Cluster 2.0
4 1000008 2012-02-09 Repair A 34 No Cluster 34.0
5 1000010 2012-02-09 Garden A 0 Cluster2 0.0
6 1000011 2012-02-10 Garden A 1 Cluster2 1.0
7 1000012 2012-02-15 Car A 5 Cluster3 5.0
8 1000013 2012-02-16 Car A 1 Cluster3 1.0
9 1000015 2012-02-17 Car A 1 Cluster3 1.0
10 1234574 2012-02-17 Bike A 0 No Cluster 0.0

关于python - 如果使用 Pandas 在数据框中再次重复相同的设置模式,如何分配唯一的列名?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60753369/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com