gpt4 book ai didi

python - 从分组中查找平均值并显示所有信息

转载 作者:行者123 更新时间:2023-12-01 06:26:37 25 4
gpt4 key购买 nike

我有这个数据框。

df1 = pd.DataFrame({'userId': [1,1,1,2,2,3,4,4],
'movieId': [500,600,700,1100,1200,600,600,1900],
'ratings': [3.5,4.5,2.0,5.0,4.0,4.5,5.0,3.5]})


df2 = pd.DataFrame({'userId':[1,1,2,3,4,5],
'movieId':[500,600,1100,800,900,600],
'tag':['Highly quotable','Boxing story','MMA','Tom Hardy','Fun','long movie']})


frames = [df1, df2]
result = pd.concat(frames, sort = False)
result

userId movieId ratings tag
0 1 500 3.5 NaN
1 1 600 4.5 NaN
2 1 700 2.0 NaN
3 2 1100 5.0 NaN
4 2 1200 4.0 NaN
5 3 600 4.5 NaN
6 4 600 5.0 NaN
7 4 1900 3.5 NaN
0 1 500 NaN Highly quotable
1 1 600 NaN Boxing story
2 2 1100 NaN MMA
3 3 800 NaN Tom Hardy
4 4 900 NaN Fun
5 5 600 NaN long movie

我正在尝试按 movieId 进行分组。我想要的是计算每部电影的出现次数。如果计数为 2 或大于 2,则应采用此场景的 ratings 的平均值并显示所有信息。我已经尝试过这个,但它给出了错误。 KeyError:“评级”

这是代码

group = result.groupby('movieId')['movieId'].count().reset_index(name="count")
agg = group['ratings'].mean().reset_index(name="mean")
agg
#right code here

最佳答案

我会提出一些不同的建议。我不会使用 concat,而是使用 pd.merge

看看这个:

import pandas as pd

df1 = pd.DataFrame({'userId': [1,1,1,2,2,3,4,4],
'movieId': [500,600,700,1100,1200,600,600,1900],
'ratings': [3.5,4.5,2.0,5.0,4.0,4.5,5.0,3.5]})


df2 = pd.DataFrame({'userId':[1,1,2,3,4,5],
'movieId':[500,600,1100,800,900,600],
'tag':['Highly quotable','Boxing story','MMA','Tom Hardy','Fun','long movie']})

# Merging df1 and df2, now you'll not have unnecessary NaN Values
result = df1.merge(df2[['movieId', 'tag']], on='movieId', how='left')

# Grouping by using two tipes of output with agg
result.groupby(by=['movieId', 'tag'], as_index=False).agg({'ratings': ['count', 'mean']})

输出将是:

  movieId              tag ratings          
count mean
0 500 Highly quotable 1 3.500000
1 600 Boxing story 3 4.666667
2 600 long movie 3 4.666667
3 1100 MMA 1 5.000000

希望它对你有用

编辑

正如您在评论中所问的,如果您想过滤数据框,只需运行以下代码即可:

# Removing multiindex columns (just to be easier for you)
result = result.droplevel(0, axis=1)
result.columns = ['userId', 'movieId', 'ratings_count', 'ratings_mean']

# Filtering
result = result[result['ratings_count'] >= 2]
result = result[result['ratings_mean'] >= 3]

有更好的方法可以做到这一点,但我假设您还不知道如何使用 Pandas MultiIndex,所以我做了一个简单的解决方案。

关于python - 从分组中查找平均值并显示所有信息,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60114126/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com