gpt4 book ai didi

python - 使用 NaN 按列对数据进行 Winsorizing

转载 作者:太空宇宙 更新时间:2023-11-03 13:59:52 45 4
gpt4 key购买 nike

我想对 pandas 数据框中的几列数据进行winsorize。每列都有一些 NaN,这会影响 winsorization,因此需要将它们删除。我知道如何执行此操作的唯一方法是为所有 数据删除它们,而不是仅逐列删除它们。

MWE:

import numpy as np
import pandas as pd
from scipy.stats.mstats import winsorize

# Create Dataframe
N, M, P = 10**5, 4, 10**2
dates = pd.date_range('2001-01-01', periods=N//P, freq='D').repeat(P)
df = pd.DataFrame(np.random.random((N, M))
, index=dates)
df.index.names = ['DATE']
df.columns = ['one','two','three','four']
# Now scale them differently so you can see the winsorization
df['four'] = df['four']*(10**5)
df['three'] = df['three']*(10**2)
df['two'] = df['two']*(10**-1)
df['one'] = df['one']*(10**-4)
# Create NaN
df.loc[df.index.get_level_values(0).year == 2002,'three'] = np.nan
df.loc[df.index.get_level_values(0).month == 2,'two'] = np.nan
df.loc[df.index.get_level_values(0).month == 1,'one'] = np.nan

这是基线分布:

df.quantile([0, 0.01, 0.5, 0.99, 1])

输出:

               one           two      three          four
0.00 2.336618e-10 2.294259e-07 0.002437 2.305353
0.01 9.862626e-07 9.742568e-04 0.975807 1003.814520
0.50 4.975859e-05 4.981049e-02 50.290946 50374.548980
0.99 9.897463e-05 9.898590e-02 98.978263 98991.438985
1.00 9.999983e-05 9.999966e-02 99.996793 99999.437779

这就是我的整理方式:

def using_mstats(s):
return winsorize(s, limits=[0.01, 0.01])

wins = df.apply(using_mstats, axis=0)
wins.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])

这给出了这个:

Out[356]:
one two three four
0.00 0.000001 0.001060 1.536882 1003.820149
0.01 0.000001 0.001060 1.536882 1003.820149
0.25 0.000025 0.024975 25.200378 25099.994780
0.50 0.000050 0.049810 50.290946 50374.548980
0.75 0.000075 0.074842 74.794537 75217.343920
0.99 0.000099 0.098986 98.978263 98991.436957
1.00 0.000100 0.100000 99.996793 98991.436957

4 列是正确的,因为它没有 NaN,但其他列不正确。第 99 个百分位数和最大值应该相同。两者的观察计数相同:

In [357]: df.count()
Out[357]:
one 90700
two 91600
three 63500
four 100000
dtype: int64

In [358]: wins.count()
Out[358]:
one 90700
two 91600
three 63500
four 100000
dtype: int64

这就是我“解决”它的方法,但代价是丢失大量数据:

wins2 = df.loc[df.notnull().all(axis=1)].apply(using_mstats, axis=0)
wins2.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])

输出:

Out[360]:
one two three four
0.00 9.686203e-07 0.000928 0.965702 1005.209503
0.01 9.686203e-07 0.000928 0.965702 1005.209503
0.25 2.486052e-05 0.024829 25.204032 25210.837443
0.50 4.980946e-05 0.049894 50.299004 50622.227179
0.75 7.492750e-05 0.075059 74.837900 75299.906415
0.99 9.895563e-05 0.099014 98.972310 99014.311761
1.00 9.895563e-05 0.099014 98.972310 99014.311761

In [361]: wins2.count()
Out[361]:
one 51700
two 51700
three 51700
four 51700
dtype: int64

如何按列对不是 NaN 的数据进行缩尾处理,同时保持数据形状(即不删除行)?

最佳答案

正如经常发生的那样,简单地创建 MWE 有助于澄清。我需要将 clip() 与 quantile() 结合使用,如下所示:

df2 = df.clip(lower=df.quantile(0.01), upper=df.quantile(0.99), axis=1)
df2.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])

输出:

               one       two      three          four
0.00 9.862626e-07 0.000974 0.975807 1003.814520
0.01 9.862666e-07 0.000974 0.975816 1003.820092
0.25 2.485043e-05 0.024975 25.200378 25099.994780
0.50 4.975859e-05 0.049810 50.290946 50374.548980
0.75 7.486737e-05 0.074842 74.794537 75217.343920
0.99 9.897462e-05 0.098986 98.978245 98991.436977
1.00 9.897463e-05 0.098986 98.978263 98991.438985

In [384]: df2.count()
Out[384]:
one 90700
two 91600
three 63500
four 100000
dtype: int64

数字与上面不同,因为我维护了每列中所有未丢失 (NaN) 的数据。

关于python - 使用 NaN 按列对数据进行 Winsorizing,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50612095/

45 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com