python - 使用 NaN 按列对数据进行 Winsorizing-6ren

python - 使用 NaN 按列对数据进行 Winsorizing

转载作者：太空宇宙更新时间：2023-11-03 13:59:52

45

4

我想对 pandas 数据框中的几列数据进行winsorize。每列都有一些 NaN，这会影响 winsorization，因此需要将它们删除。我知道如何执行此操作的唯一方法是为所有数据删除它们，而不是仅逐列删除它们。

MWE:

import numpy as np
import pandas as pd
from scipy.stats.mstats import winsorize

# Create Dataframe
N, M, P = 10**5, 4, 10**2
dates = pd.date_range('2001-01-01', periods=N//P, freq='D').repeat(P)
df = pd.DataFrame(np.random.random((N, M))
                  , index=dates)
df.index.names = ['DATE']
df.columns = ['one','two','three','four']
# Now scale them differently so you can see the winsorization
df['four'] = df['four']*(10**5)
df['three'] = df['three']*(10**2)
df['two'] = df['two']*(10**-1)
df['one'] = df['one']*(10**-4)
# Create NaN
df.loc[df.index.get_level_values(0).year == 2002,'three'] = np.nan
df.loc[df.index.get_level_values(0).month == 2,'two'] = np.nan
df.loc[df.index.get_level_values(0).month == 1,'one'] = np.nan

这是基线分布:

df.quantile([0, 0.01, 0.5, 0.99, 1])

输出:

               one           two      three          four
0.00  2.336618e-10  2.294259e-07   0.002437      2.305353
0.01  9.862626e-07  9.742568e-04   0.975807   1003.814520
0.50  4.975859e-05  4.981049e-02  50.290946  50374.548980
0.99  9.897463e-05  9.898590e-02  98.978263  98991.438985
1.00  9.999983e-05  9.999966e-02  99.996793  99999.437779

这就是我的整理方式:

def using_mstats(s):
    return winsorize(s, limits=[0.01, 0.01])

wins = df.apply(using_mstats, axis=0)
wins.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])

这给出了这个:

Out[356]:
           one       two      three          four
0.00  0.000001  0.001060   1.536882   1003.820149
0.01  0.000001  0.001060   1.536882   1003.820149
0.25  0.000025  0.024975  25.200378  25099.994780
0.50  0.000050  0.049810  50.290946  50374.548980
0.75  0.000075  0.074842  74.794537  75217.343920
0.99  0.000099  0.098986  98.978263  98991.436957
1.00  0.000100  0.100000  99.996793  98991.436957

第 4 列是正确的，因为它没有 NaN，但其他列不正确。第 99 个百分位数和最大值应该相同。两者的观察计数相同:

In [357]: df.count()
Out[357]:
one       90700
two       91600
three     63500
four     100000
dtype: int64

In [358]: wins.count()
Out[358]:
one       90700
two       91600
three     63500
four     100000
dtype: int64

这就是我“解决”它的方法，但代价是丢失大量数据:

wins2 = df.loc[df.notnull().all(axis=1)].apply(using_mstats, axis=0)
wins2.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])

输出:

Out[360]:
               one       two      three          four
0.00  9.686203e-07  0.000928   0.965702   1005.209503
0.01  9.686203e-07  0.000928   0.965702   1005.209503
0.25  2.486052e-05  0.024829  25.204032  25210.837443
0.50  4.980946e-05  0.049894  50.299004  50622.227179
0.75  7.492750e-05  0.075059  74.837900  75299.906415
0.99  9.895563e-05  0.099014  98.972310  99014.311761
1.00  9.895563e-05  0.099014  98.972310  99014.311761

In [361]: wins2.count()
Out[361]:
one      51700
two      51700
three    51700
four     51700
dtype: int64

如何按列对不是 NaN 的数据进行缩尾处理，同时保持数据形状(即不删除行)？

最佳答案

正如经常发生的那样，简单地创建 MWE 有助于澄清。我需要将 clip() 与 quantile() 结合使用，如下所示:

df2 = df.clip(lower=df.quantile(0.01), upper=df.quantile(0.99), axis=1)
df2.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])

输出:

               one       two      three          four
0.00  9.862626e-07  0.000974   0.975807   1003.814520
0.01  9.862666e-07  0.000974   0.975816   1003.820092
0.25  2.485043e-05  0.024975  25.200378  25099.994780
0.50  4.975859e-05  0.049810  50.290946  50374.548980
0.75  7.486737e-05  0.074842  74.794537  75217.343920
0.99  9.897462e-05  0.098986  98.978245  98991.436977
1.00  9.897463e-05  0.098986  98.978263  98991.438985

In [384]: df2.count()
Out[384]:
one       90700
two       91600
three     63500
four     100000
dtype: int64

数字与上面不同，因为我维护了每列中所有未丢失 (NaN) 的数据。

关于python - 使用 NaN 按列对数据进行 Winsorizing，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50612095/

45

4

0

文章推荐： Python 3.6 属性错误 : module 'statsmodels' has no attribute 'compat'

文章推荐： ssl - ISPConfig 不接受新证书

文章推荐： node.js - 如何传递 SSL 配置，如信任库位置和密码？

文章推荐： python - Django 多对多关系不返回集合对象

r - 如何在纵向数据集中进行 Winsorize(或删除单变量异常值)
我试图弄清楚如何对纵向数据集中的个人分组观察进行 winsorize。我从这个开始 excellent answer关于如何从变量的均值中删除 >2 个标准差的数据。作者还帮助展示了如何在类别内做到
Python pandas 滚动 winsorize
我有一个时间序列 pandas 数据框，并且我计算了一个新列 df['std_series']= ( df['series1']-df['series1'].rolling(252).mean() )
python - Pandas groupby winsorized 均值
正常的 groupby 均值很简单: df.groupby(['col_a','col_b']).mean()[col_i_want] 但是，如果我想应用一个 winsorized 均值(默认限制为
python - 使用 NaN 按列对数据进行 Winsorizing
我想对 pandas 数据框中的几列数据进行winsorize。每列都有一些 NaN，这会影响 winsorization，因此需要将它们删除。我知道如何执行此操作的唯一方法是为所有数据删除它们，而
python - Pandas for Python 中的 Winsorize 数据
我正在尝试在 pandas 中为 Python 运行 Winsorized 回归。 very helpful user manual提供此示例代码: winz = rets.copy() std_1y
python - 哪个 winsorize 更准确，Python 还是 R
我正在尝试实现一个 winsorization 函数，但对它的确切定义感到困惑。显然，R包中的winsorize函数，DescTool，以及Python库中的winsorize函数，scipy.sta
r - 使用 `lapply` 对数据框 (R) 中的所有列进行 Winsorizing
我正在尝试使用 Winsorize() 包中的 lapply 应用 library(DescTools) 函数。我目前拥有的是； data$col1 data$col1 [1] -0.0677
python - 在 Python 中进行 winsorize 但忽略 nan 的正确方法
我正在尝试对包含 NaN 的 pandas 系列进行 winsorize。使用掩码可以转义 NaN，但它只会在找到百分位值时转义 NaN，然后用该值替换 NaN，这不是我想要的。例如，df 由

首页

博学

6Ren·AI

商城

python - 使用 NaN 按列对数据进行 Winsorizing