gpt4 book ai didi

python - 根据数据框中的两列删除异常值

转载 作者:行者123 更新时间:2023-12-01 08:20:37 25 4
gpt4 key购买 nike

我有一个数据框如下:

Year Month Equipment   Weight
2017 1 TennisBall 5
2017 1 Football 4
2017 1 TennisBall 6
2017 1 TennisBall 7
2017 1 TennisBall 300
2017 2 TennisBall 300
2018 2 TennisBall 250
2018 2 Football 5
2018 2 TennisBall 6
2018 2 TennisBall 275
...

在上面的示例中,我们仅在 2 月份运送 300 个网球是正常的,因此 6 个单位的订单成为异常值,而在 1 月份,正常数量约为 5 个,使得任何该月较大的订单属于异常值。我想根据每月的体重来删除异常值。有没有一种简单的方法可以做到这一点?我知道我可以做一些事情:

df1[np.abs(df1.Weight-df1.Weight.mean()) <= (5*df1.Weight.std())]

抓取任何重量在平均值的 5 个偏差以内的东西,但这不会考虑按月部分,我可以看到重量因月份而发生巨大变化。谢谢!

编辑:例如,所需的输出将是这样的:

Year Month Equipment   Weight
2017 1 TennisBall 5
2017 1 Football 4
2017 1 TennisBall 6
2017 1 TennisBall 7

2017 2 TennisBall 300
2018 2 TennisBall 250
2018 2 Football 5

2018 2 TennisBall 275
...

一月份,300 的异常值被删除(一月份,这超出了正常值),二月份,6 的异常值被删除了(一月份属于正常值,但正如二月份发生的那样,它不是正常)

最佳答案

这是 groupby 的一个问题。您可以通过创建两个包含分组平均值和标准差的新列,然后对这些列进行过滤来解决此问题:

# Calculate difference between Weight and mean of group
df['Weight diff'] = df['Weight'].sub(df.groupby(['Year','Month','Equipment'])['Weight'].transform('mean'))
# Calculate standard deviation of group
df['std'] = df.groupby(['Year','Month','Equipment'])['Weight'].transform('std')

# Consider columns satisfying condition
# Include or condition accounting for NaN's from single value groups
df = df.loc[(np.abs(df['Weight diff']) <= df['std']) | (df['std'].isnull())]

# Remove unnecessary columns
df = df.drop(['Weight diff', 'std'], axis=1)

>>> print(df)

0 Year Month Equipment Weight
1 2017 1 TennisBall 5
2 2017 1 Football 4
3 2017 1 TennisBall 6
4 2017 1 TennisBall 7
6 2017 2 TennisBall 300
7 2018 2 TennisBall 250
8 2018 2 Football 5
10 2018 2 TennisBall 275

关于python - 根据数据框中的两列删除异常值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54671403/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com