gpt4 book ai didi

subset pandas dataframe to get specific number of rows based on values in another dataframe(子集Pandas DataFrame,以根据另一个数据帧中的值获取特定行数)

转载 作者:bug小助手 更新时间:2023-10-25 19:08:12 25 4
gpt4 key购买 nike



I have a pandas dataframe as follows:

我有一个熊猫数据框,如下所示:


df1

site_id date hour reach maid
0 16002 2023-09-02 21 NaN 33f9fad6-20c5-426c-962f-bc2fbb82aecb
1 16002 2023-09-04 17 NaN 33f9fad6-20c5-426c-962f-bc2fbb82aecb
2 16002 2023-09-04 19 NaN 4a676aeb-6f6f-4622-934b-59b8f149aad7
3 16002 2023-09-04 17 NaN 35363191-c6aa-49fb-beb1-04a98898bed2
4 16002 2023-09-03 22 NaN a44beb20-a90a-4135-be18-6dda71eeb7c2

I have created another dataframe based on the above dataframe that provides the count of records for each [site_id,date,hour] combination. T

我已经基于上面的数据帧创建了另一个数据帧,它提供了每个[site_id,date,hr]组合的记录计数。T


df2

site_id date hour count
1666 37226 2023-09-02 8 4586
1676 37226 2023-09-03 16 3586
639 36972 2023-09-03 21 235
640 36972 2023-09-03 22 5431
641 36972 2023-09-03 23 343

I want to filter the first data frame and get exact number of records as given in the count column of second data frame. For example, I want to get the 4586 records from the first data frame matching the site_id 37226, date 2023-09-02 and hour 8.

我想筛选第一个数据帧,并获得第二个数据帧的计数列中给出的确切记录数。例如,我想从第一个数据框中获取匹配Site_id 37226、日期2023-09-02和小时8的4586条记录。


I tried using a forloop on the second dataframe as follows:

我尝试在第二个数据帧上使用forloop,如下所示:


for index,rows in k3.iterrows():
sid=rows['site_id']
dt=rows['date']
hr=rows['hour']
cnt=rows['count']
kdf1=dff[(dff['site_id'] == sid) & (dff['date']==dt) & (dff['hour']==hr)]
kdf2=kdf1[:cnt]

This works - but works extremely slow. Is there a faster method to get the subset. I am also attaching herewith the links to both sample dataframes:

这很管用--但见效非常慢。有没有更快的方法来获得子集。我还附上两个样本数据帧的链接:


Link to df1 and df2

链接到df1和df2


更多回答
优秀答案推荐

You can merge the count from df2 to df1, and then using .groupby to reduce the count of groups:

您可以将计数从df2合并到df1,然后使用.groupby减少组的计数:


cols = ["site_id", "date", "hour"]


df1 = df1.merge(df2, on=cols, how="right")
df1 = df1.groupby(cols, group_keys=False).apply(lambda x: x[: x["count"].iloc[0]])
df1.pop("count")

print(df1.head())

Prints:

打印:


   site_id        date  hour  reach                                  maid
0 37221 2023-09-03 19 NaN 3e769e74-9129-49ba-838d-c36f3a9b3335
1 37221 2023-09-03 19 NaN 71e258d2-5155-4001-9b3c-02a1a1f9c9fb
2 37221 2023-09-03 19 NaN 92eaee88-b41c-4999-b1b8-6be183e5d2cf
3 37221 2023-09-03 19 NaN c6eb504a-9259-410b-8391-7b06b3e92a41
4 37221 2023-09-03 19 NaN c36400ff-0790-4844-b58b-2e4cdaafb4d9

Note: With your data this method takes ~0.15 seconds, your original version ~11.2 seconds.

注意:对于您的数据,此方法需要大约0.15秒,而您的原始版本需要大约11.2秒。



Add a sequential counter per group using cumcount then merge the dataframes and filter the rows where the value of counter is less than required count

使用Cumcount为每个组添加一个顺序计数器,然后合并数据帧并过滤Counter值小于所需Count的行


c = ['site_id', 'date', 'hour']

df1['rnum'] = df1.groupby(c).cumcount().add(1)
result = df1.merge(df2, how='left', on=c).query('rnum <= count')



   site_id        date  hour  reach                                  maid  rnum  count
0 16002 2023-09-02 21 NaN 33f9fad6-20c5-426c-962f-bc2fbb82aecb 1 71
1 16002 2023-09-04 17 NaN 33f9fad6-20c5-426c-962f-bc2fbb82aecb 1 40
2 16002 2023-09-04 19 NaN 4a676aeb-6f6f-4622-934b-59b8f149aad7 1 50
3 16002 2023-09-04 17 NaN 35363191-c6aa-49fb-beb1-04a98898bed2 2 40
4 16002 2023-09-03 22 NaN a44beb20-a90a-4135-be18-6dda71eeb7c2 1 61
5 16002 2023-09-02 8 NaN 37dfb047-c058-409b-aa3a-24e2e5c03f35 1 32
6 16002 2023-09-02 11 NaN 37dfb047-c058-409b-aa3a-24e2e5c03f35 1 54
7 16002 2023-09-03 10 NaN f52924a8-c487-4355-9e67-aa8392c4635a 1 45
8 16002 2023-09-04 21 NaN 7c41c274-da15-4d8b-bcde-da0ee7bb7566 1 44
9 16002 2023-09-03 14 NaN 7c41c274-da15-4d8b-bcde-da0ee7bb7566 1 58

更多回答

Many thanks....this worked perfectly....thank you

非常感谢……这很好用……谢谢。

Thank you for the answer. The cumcount() is a new function until me and I think I will find a lot of usage in my day today work. Thank you again.

谢谢你的回答。cumcount()是一个新的函数,直到我,我想我会发现在我今天的工作中有很多用法。再次感谢你。

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com