gpt4 book ai didi

python - 使用 Pandas 计算不同子段的 T 统计量

转载 作者:行者123 更新时间:2023-12-01 08:24:23 24 4
gpt4 key购买 nike

我正在尝试计算数据框中不同子段的 p 值和 t 值。

数据框有两列,以下是我的数据框中的前 5 个值:

df[["Engagement_score", "Performance"]].head()
Engagement_score Performance
0 6 0.0
1 5 0.0
2 7 66.3
3 3 0.0
4 11 0.0

我按参与度分数对数据框进行分组,然后计算这些组的这三个统计数据:

1) 平均表现得分 (sub_average) 和该组内的值数量 (sub_bookings)

2) 其余组的平均表现得分 (rest_average) 和其余组中的值数量 (rest_bookings)

总体绩效得分和总体预订是针对总体数据框架计算的。

这是我执行此操作的代码。

def stats_comparison(i):
df.groupby(i)['Performance'].agg({
'average': 'mean',
'bookings': 'count'
}).reset_index()
cat = df.groupby(i)['Performance']\
.agg({
'sub_average': 'mean',
'sub_bookings': 'count'
}).reset_index()
cat['overall_average'] = df['Performance'].mean()
cat['overall_bookings'] = df['Performance'].count()
cat['rest_bookings'] = cat['overall_bookings'] - cat['sub_bookings']
cat['rest_average'] = (cat['overall_bookings']*cat['overall_average'] \
- cat['sub_bookings']*cat['sub_average'])/cat['rest_bookings']
cat['t_value'] = stats.ttest_ind(cat['sub_average'], cat['rest_average'])[0]


cat['prob'] = stats.ttest_ind(cat['sub_average'], cat['rest_average'])[1] # this is the p value
cat['significant'] = [(lambda x: 1 if x > 0.9 else -1 if x < 0.1 else 0)(i) for i in cat['prob']]
# if the p value is less than 0.1 then I can confidently say that the 2 samples are different.

print(cat)

stats_comparison('Engagement_score')

我得到了以下输出,但我的子段得到了相同的 P 值和 T 值,如何在不编写循环的情况下为这些子段得出不同的 p 值和 t 值:

    Engagement_score  sub_average  sub_bookings  overall_average  \
0 3 68.493120 1032 69.18413
1 4 71.018214 571 69.18413
2 5 70.265373 670 69.18413
3 6 68.986506 704 69.18413
4 7 69.587893 636 69.18413
5 8 70.215244 656 69.18413
6 9 63.495813 812 69.18413
7 10 71.235994 664 69.18413
8 11 69.302559 508 69.18413
9 12 81.980952 105 69.18413

overall_bookings rest_bookings rest_average t_value prob \
0 6358 5326 69.318025 0.870172 0.395663
1 6358 5787 69.003162 0.870172 0.395663
2 6358 5688 69.056769 0.870172 0.395663
3 6358 5654 69.208737 0.870172 0.395663
4 6358 5722 69.139252 0.870172 0.395663
5 6358 5702 69.065503 0.870172 0.395663
6 6358 5546 70.016967 0.870172 0.395663
7 6358 5694 68.944854 0.870172 0.395663
8 6358 5850 69.173846 0.870172 0.395663
9 6358 6253 68.969247 0.870172 0.395663

最佳答案

我认为您可以对参与组进行简单的循环。

示例数据

import numpy as np
import pandas as pd
from scipy import stats

np.random.seed(123)
df = pd.DataFrame({'Engagement Score': np.random.choice(list('abcde'), 1000),
'Performance': np.random.normal(0,1,1000)})

代码

# Get all of the subgroup averages and counts
d = {'mean': 'sub_average', 'size': 'sub_bookings'}
df_res = df.groupby('Engagement Score').Performance.agg(['mean', 'size']).rename(columns=d)

# Add overall values
df_res['overall_avg'] = df.Performance.mean()
df_res['overall_bookings'] = len(df)

# T-test of each subgroup against everything not in that subgroup.
for grp in df['Engagement Score'].unique():
# mask to separate the groups
m = df['Engagement Score'] == grp
# Decide whether you want to assume equal variances. equal_var=True by default.
t,p = stats.ttest_ind(df.loc[m, 'Performance'], df.loc[~m, 'Performance'])
df_res.loc[grp, 't-stat'] = t
df_res.loc[grp, 'p-value'] = p

输出df_res:

                  sub_average  sub_bookings  overall_avg  overall_bookings    t_stat   p-value
Engagement Score
a -0.024469 203 -0.03042 1000 0.094585 0.924663
b -0.053663 206 -0.03042 1000 -0.372866 0.709328
c 0.080888 179 -0.03042 1000 1.638958 0.101537
d -0.127941 224 -0.03042 1000 -1.652303 0.098787
e -0.001161 188 -0.03042 1000 0.443412 0.657564

正如预期的那样,没有什么是重要的,因为它们都来自相同的正态分布。

关于python - 使用 Pandas 计算不同子段的 T 统计量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54378506/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com