I have a pandas dataframe as follows:
我有一个熊猫数据框,如下所示:
df1
site_id date hour reach maid
0 16002 2023-09-02 21 NaN 33f9fad6-20c5-426c-962f-bc2fbb82aecb
1 16002 2023-09-04 17 NaN 33f9fad6-20c5-426c-962f-bc2fbb82aecb
2 16002 2023-09-04 19 NaN 4a676aeb-6f6f-4622-934b-59b8f149aad7
3 16002 2023-09-04 17 NaN 35363191-c6aa-49fb-beb1-04a98898bed2
4 16002 2023-09-03 22 NaN a44beb20-a90a-4135-be18-6dda71eeb7c2
I have created another dataframe based on the above dataframe that provides the count of records for each [site_id,date,hour]
combination. T
我已经基于上面的数据帧创建了另一个数据帧,它提供了每个[site_id,date,hr]组合的记录计数。T
df2
site_id date hour count
1666 37226 2023-09-02 8 4586
1676 37226 2023-09-03 16 3586
639 36972 2023-09-03 21 235
640 36972 2023-09-03 22 5431
641 36972 2023-09-03 23 343
I want to filter the first data frame and get exact number of records as given in the count
column of second data frame. For example, I want to get the 4586
records from the first data frame matching the site_id 37226, date 2023-09-02 and hour 8
.
我想筛选第一个数据帧,并获得第二个数据帧的计数列中给出的确切记录数。例如,我想从第一个数据框中获取匹配Site_id 37226、日期2023-09-02和小时8的4586条记录。
I tried using a forloop on the second dataframe as follows:
我尝试在第二个数据帧上使用forloop,如下所示:
for index,rows in k3.iterrows():
sid=rows['site_id']
dt=rows['date']
hr=rows['hour']
cnt=rows['count']
kdf1=dff[(dff['site_id'] == sid) & (dff['date']==dt) & (dff['hour']==hr)]
kdf2=kdf1[:cnt]
This works - but works extremely slow. Is there a faster method to get the subset. I am also attaching herewith the links to both sample dataframes:
这很管用--但见效非常慢。有没有更快的方法来获得子集。我还附上两个样本数据帧的链接:
Link to df1 and df2
链接到df1和df2
更多回答
优秀答案推荐
You can merge the count
from df2
to df1
, and then using .groupby
to reduce the count of groups:
您可以将计数从df2合并到df1,然后使用.groupby减少组的计数:
cols = ["site_id", "date", "hour"]
df1 = df1.merge(df2, on=cols, how="right")
df1 = df1.groupby(cols, group_keys=False).apply(lambda x: x[: x["count"].iloc[0]])
df1.pop("count")
print(df1.head())
Prints:
打印:
site_id date hour reach maid
0 37221 2023-09-03 19 NaN 3e769e74-9129-49ba-838d-c36f3a9b3335
1 37221 2023-09-03 19 NaN 71e258d2-5155-4001-9b3c-02a1a1f9c9fb
2 37221 2023-09-03 19 NaN 92eaee88-b41c-4999-b1b8-6be183e5d2cf
3 37221 2023-09-03 19 NaN c6eb504a-9259-410b-8391-7b06b3e92a41
4 37221 2023-09-03 19 NaN c36400ff-0790-4844-b58b-2e4cdaafb4d9
Note: With your data this method takes ~0.15 seconds, your original version ~11.2 seconds.
注意:对于您的数据,此方法需要大约0.15秒,而您的原始版本需要大约11.2秒。
Add a sequential counter per group using cumcount
then merge
the dataframes and filter the rows where the value of counter is less than required count
使用Cumcount为每个组添加一个顺序计数器,然后合并数据帧并过滤Counter值小于所需Count的行
c = ['site_id', 'date', 'hour']
df1['rnum'] = df1.groupby(c).cumcount().add(1)
result = df1.merge(df2, how='left', on=c).query('rnum <= count')
site_id date hour reach maid rnum count
0 16002 2023-09-02 21 NaN 33f9fad6-20c5-426c-962f-bc2fbb82aecb 1 71
1 16002 2023-09-04 17 NaN 33f9fad6-20c5-426c-962f-bc2fbb82aecb 1 40
2 16002 2023-09-04 19 NaN 4a676aeb-6f6f-4622-934b-59b8f149aad7 1 50
3 16002 2023-09-04 17 NaN 35363191-c6aa-49fb-beb1-04a98898bed2 2 40
4 16002 2023-09-03 22 NaN a44beb20-a90a-4135-be18-6dda71eeb7c2 1 61
5 16002 2023-09-02 8 NaN 37dfb047-c058-409b-aa3a-24e2e5c03f35 1 32
6 16002 2023-09-02 11 NaN 37dfb047-c058-409b-aa3a-24e2e5c03f35 1 54
7 16002 2023-09-03 10 NaN f52924a8-c487-4355-9e67-aa8392c4635a 1 45
8 16002 2023-09-04 21 NaN 7c41c274-da15-4d8b-bcde-da0ee7bb7566 1 44
9 16002 2023-09-03 14 NaN 7c41c274-da15-4d8b-bcde-da0ee7bb7566 1 58
更多回答
Many thanks....this worked perfectly....thank you
非常感谢……这很好用……谢谢。
Thank you for the answer. The cumcount()
is a new function until me and I think I will find a lot of usage in my day today work. Thank you again.
谢谢你的回答。Cumcount()是一个新函数,直到我和我想我会在我今天的工作中发现很多用法。再次感谢您。
我是一名优秀的程序员,十分优秀!