gpt4 book ai didi

python - Scikit-Learn:Std.Error,来自 LinearRegression 的 p 值

转载 作者:行者123 更新时间:2023-12-04 11:58:09 26 4
gpt4 key购买 nike

我一直在尝试使用 scikit-learn 中的 LR 来获得标准误差和 p 值。但没有成功。

我最终找到了这个 article :但 std 错误和 p 值与来自 statsmodel.api OLS 方法的不匹配

import numpy as np 
from sklearn import datasets
from sklearn import linear_model
import regressor
import statsmodels.api as sm


boston = datasets.load_boston()
which_betas = np.ones(13, dtype=bool)
which_betas[3] = False
X = boston.data[:,which_betas]
y = boston.target

#scikit + regressor stats
ols = linear_model.LinearRegression()
ols.fit(X,y)

xlables = boston.feature_names[which_betas]
regressor.summary(ols, X, y, xlables)


# statsmodel
x2 = sm.add_constant(X)
models = sm.OLS(y,x2)
result = models.fit()
print result.summary()

输出如下:
Residuals:
Min 1Q Median 3Q Max
-26.3743 -1.9207 0.6648 2.8112 13.3794


Coefficients:
Estimate Std. Error t value p value
_intercept 36.925033 4.915647 7.5117 0.000000
CRIM -0.112227 0.031583 -3.5534 0.000416
ZN 0.047025 0.010705 4.3927 0.000014
INDUS 0.040644 0.055844 0.7278 0.467065
NOX -17.396989 3.591927 -4.8434 0.000002
RM 3.845179 0.272990 14.0854 0.000000
AGE 0.002847 0.009629 0.2957 0.767610
DIS -1.485557 0.180530 -8.2289 0.000000
RAD 0.327895 0.061569 5.3257 0.000000
TAX -0.013751 0.001055 -13.0395 0.000000
PTRATIO -0.991733 0.088994 -11.1438 0.000000
B 0.009827 0.001126 8.7256 0.000000
LSTAT -0.534914 0.042128 -12.6973 0.000000
---
R-squared: 0.73547, Adjusted R-squared: 0.72904
F-statistic: 114.23 on 12 features
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.735
Model: OLS Adj. R-squared: 0.729
Method: Least Squares F-statistic: 114.2
Date: Sun, 21 Aug 2016 Prob (F-statistic): 7.59e-134
Time: 21:56:26 Log-Likelihood: -1503.8
No. Observations: 506 AIC: 3034.
Df Residuals: 493 BIC: 3089.
Df Model: 12
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 36.9250 5.148 7.173 0.000 26.811 47.039
x1 -0.1122 0.033 -3.405 0.001 -0.177 -0.047
x2 0.0470 0.014 3.396 0.001 0.020 0.074
x3 0.0406 0.062 0.659 0.510 -0.081 0.162
x4 -17.3970 3.852 -4.516 0.000 -24.966 -9.828
x5 3.8452 0.421 9.123 0.000 3.017 4.673
x6 0.0028 0.013 0.214 0.831 -0.023 0.029
x7 -1.4856 0.201 -7.383 0.000 -1.881 -1.090
x8 0.3279 0.067 4.928 0.000 0.197 0.459
x9 -0.0138 0.004 -3.651 0.000 -0.021 -0.006
x10 -0.9917 0.131 -7.547 0.000 -1.250 -0.734
x11 0.0098 0.003 3.635 0.000 0.005 0.015
x12 -0.5349 0.051 -10.479 0.000 -0.635 -0.435
==============================================================================
Omnibus: 190.837 Durbin-Watson: 1.015
Prob(Omnibus): 0.000 Jarque-Bera (JB): 897.143
Skew: 1.619 Prob(JB): 1.54e-195
Kurtosis: 8.663 Cond. No. 1.51e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.51e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

我还发现了以下文章
  • Find p-value (significance) in scikit-learn LinearRegression
  • http://connor-johnson.com/2014/02/18/linear-regression-with-python/

  • SO 链接中的两个代码都无法编译

    这是我正在处理的代码和数据 - 但无法找到标准错误和 p 值
    import pandas as pd
    import statsmodels.api as sm
    import numpy as np
    import scipy
    from sklearn.linear_model import LinearRegression
    from sklearn import metrics


    def readFile(filename, sheetname):
    xlsx = pd.ExcelFile(filename)
    data = xlsx.parse(sheetname, skiprows=1)
    return data


    def lr_statsmodel(X,y):
    X = sm.add_constant(X)
    model = sm.OLS(y,X)
    results = model.fit()
    print (results.summary())


    def lr_scikit(X,y,featureCols):
    model = LinearRegression()
    results = model.fit(X,y)

    predictions = results.predict(X)

    print 'Coefficients'
    print 'Intercept\t' , results.intercept_
    df = pd.DataFrame(zip(featureCols, results.coef_))
    print df.to_string(index=False, header=False)

    # Query:: The numbers matches with Excel OLS but skeptical about relating score as rsquared
    rSquare = results.score(X,y)
    print '\nR-Square::', rSquare

    # This looks like a better option
    # source: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score
    r2 = metrics.r2_score(y,results.predict(X))
    print 'r2', r2

    # Query: No clue at all! http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics
    print 'Rsquared?!' , metrics.explained_variance_score(y, results.predict(X))
    # INFO:: All three of them are providing the same figures!


    # Adj-Rsquare formula @ https://www.easycalculation.com/statistics/learn-adjustedr2.php
    # In ML, we don't use all of the data for training, and hence its highly unusual to find AdjRsquared. Thus the need for manual calculation
    N = X.shape[0]
    p = X.shape[1]
    adjRsquare = 1 - ((1 - rSquare ) * (N - 1) / (N - p - 1))
    print "Adjusted R-Square::", adjRsquare

    # calculate standard errors
    # mean_absolute_error
    # mean_squared_error
    # median_absolute_error
    # r2_score
    # explained_variance_score
    mse = metrics.mean_squared_error(y,results.predict(X))
    print mse
    print 'Residual Standard Error:', np.sqrt(mse)

    # OLS in Matrix : https://github.com/nsh87/regressors/blob/master/regressors/stats.py
    n = X.shape[0]
    X1 = np.hstack((np.ones((n, 1)), np.matrix(X)))
    se_matrix = scipy.linalg.sqrtm(
    metrics.mean_squared_error(y, results.predict(X)) *
    np.linalg.inv(X1.T * X1)
    )
    print 'se',np.diagonal(se_matrix)

    # https://github.com/nsh87/regressors/blob/master/regressors/stats.py
    # http://regressors.readthedocs.io/en/latest/usage.html

    y_hat = results.predict(X)
    sse = np.sum((y_hat - y) ** 2)
    print 'Standard Square Error of the Model:', sse




    if __name__ == '__main__':

    # read file
    fileData = readFile('Linear_regression.xlsx','Input Data')

    # list of independent variables
    feature_cols = ['Price per week','Population of city','Monthly income of riders','Average parking rates per month']

    # build dependent & independent data set
    X = fileData[feature_cols]
    y = fileData['Number of weekly riders']

    # Statsmodel - OLS
    # lr_statsmodel(X,y)

    # ScikitLearn - OLS
    lr_scikit(X,y,feature_cols)

    我的数据集
    Y   X1  X2  X3  X4
    City Number of weekly riders Price per week Population of city Monthly income of riders Average parking rates per month
    1 1,92,000 $15 18,00,000 $5,800 $50
    2 1,90,400 $15 17,90,000 $6,200 $50
    3 1,91,200 $15 17,80,000 $6,400 $60
    4 1,77,600 $25 17,78,000 $6,500 $60
    5 1,76,800 $25 17,50,000 $6,550 $60
    6 1,78,400 $25 17,40,000 $6,580 $70
    7 1,80,800 $25 17,25,000 $8,200 $75
    8 1,75,200 $30 17,25,000 $8,600 $75
    9 1,74,400 $30 17,20,000 $8,800 $75
    10 1,73,920 $30 17,05,000 $9,200 $80
    11 1,72,800 $30 17,10,000 $9,630 $80
    12 1,63,200 $40 17,00,000 $10,570 $80
    13 1,61,600 $40 16,95,000 $11,330 $85
    14 1,61,600 $40 16,95,000 $11,600 $100
    15 1,60,800 $40 16,90,000 $11,800 $105
    16 1,59,200 $40 16,30,000 $11,830 $105
    17 1,48,800 $65 16,40,000 $12,650 $105
    18 1,15,696 $102 16,35,000 $13,000 $110
    19 1,47,200 $75 16,30,000 $13,224 $125
    20 1,50,400 $75 16,20,000 $13,766 $130
    21 1,52,000 $75 16,15,000 $14,010 $150
    22 1,36,000 $80 16,05,000 $14,468 $155
    23 1,26,240 $86 15,90,000 $15,000 $165
    24 1,23,888 $98 15,95,000 $15,200 $175
    25 1,26,080 $87 15,90,000 $15,600 $175
    26 1,51,680 $77 16,00,000 $16,000 $190
    27 1,52,800 $63 16,10,000 $16,200 $200

    我已经用尽了我所有的选择和任何我能理解的东西。因此,任何关于如何计算与 statsmodel.api 相同的标准误差和 p 值的指导都值得赞赏。

    编辑:我试图找到截距和所有自变量的标准错误和 p 值

    最佳答案

    这是 reg 是 sklearn 的 lin 回归拟合方法的输出
    计算调整后的 r2

    def adjustedR2(x,y reg):
    r2 = reg.score(x,y)
    n = x.shape[0]
    p = x.shape[1]
    adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
    return adjusted_r2
    对于 p 值
    from sklearn.feature_selection import f_regression

    freg=f_regression(x,y)

    p=freg[1]

    print(p.round(3))

    关于python - Scikit-Learn:Std.Error,来自 LinearRegression 的 p 值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39066567/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com