gpt4 book ai didi

python - 使用 np.std 作为函数参数的 Pandas apply 函数输出不一致

转载 作者:太空狗 更新时间:2023-10-30 01:30:54 28 4
gpt4 key购买 nike

我正在使用 sklearn.preprocessing.StandardScaler 重新缩放我的数据。我想使用 np.stdStandardScaler 做同样的事情。

但是,我发现一件有趣的事情,如果没有在 pandas.apply(fun = np.std) 中传递额外的参数,样本标准和总体标准之间的输出会有所不同。 (见2题)

我知道有一个参数叫做ddof,它在计算样本方差时控制除数。如果不改变默认参数ddof = 0,我怎么会得到不同的输出!

1个数据集:

首先,我以鸢尾花数据集为例。我按如下方式缩放数据的第一列。

from sklearn import datasets
import numpy as np
from sklearn.preprocessing import StandardScaler
iris = datasets.load_iris()
X_train = iris.data[:,[1]] # my X_train is the first column if iris data
sc = StandardScaler()
sc.fit(X_train) # Using StandardScaler to scale it!

2 问题:没有更改默认值 ddof = 0 我得到了不同的 np.std 输出!

import pandas as pd
import sys
print("The mean and std(sample std) of X_train is :")
print(pd.DataFrame(X_train).apply([np.mean,np.std],axis = 0),"\n")

print("The std(population std) of X_train is :")
print(pd.DataFrame(X_train).apply(np.std,axis = 0),"\n")

print("The std(population std) of X_train is :","{0:.6f}".format(sc.scale_[0]),'\n')

print("Python version:",sys.version,
"\npandas version:",pd.__version__,
"\nsklearn version:",sklearn.__version__)

输出:

The mean and std(sample std) of X_train is :
0
mean 3.057333
std 0.435866

The std(population std) of X_train is :
0 0.434411
dtype: float64

The std(population std) of X_train is : 0.434411

Python version: 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)]
pandas version: 0.23.4
sklearn version: 0.20.1

根据以上结果,pd.DataFrame(X_train).apply([np.mean,np.std],axis = 0) 给出样本标准 0.435866 而 pd.DataFrame(X_train ).apply(np.std,axis = 0) 给出人口标准 0.434411。

3 我的问题:

  1. 为什么使用 pandas.apply 返回不同的结果?

  2. 如何将附加参数传递给 np.std,它给出了 population std?

pd.DataFrame(X_train).apply(np.std,ddof = 1) 可以做到。但我想知道 pd.DataFrame(X_train).apply([np.mean,np.std],**args)

最佳答案

可以在系列上对 .apply() 的(可能不雅)评估中找到此行为的原因。如果你有 look at the source code ,您会发现以下几行:

if isinstance(func, (list, dict)):
return self.aggregate(func, *args, **kwds)

这意味着:如果您调用 apply([func]),结果可能会与 apply(func) 不同!关于 np.std,我建议使用内置的 df.std() 方法或 df.describe()

您可以尝试以下代码以了解哪些有效,哪些无效:

import numpy as np
import pandas as pd

print(10*"-","Showing ddof impact",10*"-")

print(np.std([4,5], ddof=0)) # 0.5 ## N (population's standard deviation)
print(np.std([4,5], ddof=1)) # 0.707... # N-1 (unbiased sample variance)

x = pd.Series([4,5])

print(10*"-","calling builtin .std() on Series",10*"-")
print(x.std(ddof=0)) # 0.5
print(x.std()) # 0.707

df=pd.DataFrame([[4,5],[5,6]], columns=['A', 'B'])

print(10*"-","calling builtin .std() on DF",10*"-")

print(df["A"].std(ddof=0))# 0.5
print(df["B"].std(ddof=0))# 0.5
print(df["A"].std())# 0.707
print(df["B"].std())# 0.707

print(10*"-","applying np.std to whole DF",10*"-")
print(df.apply(np.std,ddof=0)) # A = 0.5, B = 0.5
print(df.apply(np.std,ddof=1)) # A = 0.707 B = 0.707

# print(10*"-","applying [np.std] to whole DF WONT work",10*"-")
# print(df.apply([np.std],axis=0,ddof=0)) ## this WONT Work
# print(df.apply([np.std],axis=0,ddof=1)) ## this WONT Work

print(10*"-","applying [np.std] to DF columns",10*"-")
print(df["A"].apply([np.std])) # 0.707
print(df["A"].apply([np.std],ddof=1)) # 0.707

print(10*"-","applying np.std to DF columns",10*"-")
print(df["A"].apply(np.std)) # 0: 0 1: 0 WHOOPS !! #<---------------------
print(30*"-")

您还可以通过应用您自己的函数来了解发生了什么:

def myFun(a):
print(type(a))
return np.std(a,ddof=0)

print("> 0",20*"-")
print(x.apply(myFun))
print("> 1",20*"-","## <- only this will be applied to the Series!")
print(df.apply(myFun))
print("> 2",20*"-","## <- this will be applied to each Int!")
print(df.apply([myFun]))
print("> 3",20*"-")
print(df["A"].apply(myFun))
print("> 4",20*"-")
print(df["A"].apply([myFun]))

关于python - 使用 np.std 作为函数参数的 Pandas apply 函数输出不一致,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55675472/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com