python - Scikit-Learn:Std.Error，来自 LinearRegression 的 p 值-6ren

python - Scikit-Learn:Std.Error，来自 LinearRegression 的 p 值

转载作者：行者123 更新时间：2023-12-04 11:58:09

我一直在尝试使用 scikit-learn 中的 LR 来获得标准误差和 p 值。但没有成功。

我最终找到了这个 article :但 std 错误和 p 值与来自 statsmodel.api OLS 方法的不匹配

import numpy as np 
from sklearn import datasets
from sklearn import linear_model
import regressor
import statsmodels.api as sm 


boston = datasets.load_boston()
which_betas = np.ones(13, dtype=bool)
which_betas[3] = False
X = boston.data[:,which_betas]
y = boston.target

#scikit + regressor stats
ols = linear_model.LinearRegression()
ols.fit(X,y)

xlables = boston.feature_names[which_betas]
regressor.summary(ols, X, y, xlables)


# statsmodel
x2 = sm.add_constant(X)
models = sm.OLS(y,x2)
result = models.fit()
print result.summary()

输出如下:

Residuals:
Min      1Q  Median      3Q      Max
-26.3743 -1.9207  0.6648  2.8112  13.3794


Coefficients:
             Estimate  Std. Error  t value   p value
_intercept  36.925033    4.915647   7.5117  0.000000
CRIM        -0.112227    0.031583  -3.5534  0.000416
ZN           0.047025    0.010705   4.3927  0.000014
INDUS        0.040644    0.055844   0.7278  0.467065
NOX        -17.396989    3.591927  -4.8434  0.000002
RM           3.845179    0.272990  14.0854  0.000000
AGE          0.002847    0.009629   0.2957  0.767610
DIS         -1.485557    0.180530  -8.2289  0.000000
RAD          0.327895    0.061569   5.3257  0.000000
TAX         -0.013751    0.001055 -13.0395  0.000000
PTRATIO     -0.991733    0.088994 -11.1438  0.000000
B            0.009827    0.001126   8.7256  0.000000
LSTAT       -0.534914    0.042128 -12.6973  0.000000
---
R-squared:  0.73547,    Adjusted R-squared:  0.72904
F-statistic: 114.23 on 12 features
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.735
Model:                            OLS   Adj. R-squared:                  0.729
Method:                 Least Squares   F-statistic:                     114.2
Date:                Sun, 21 Aug 2016   Prob (F-statistic):          7.59e-134
Time:                        21:56:26   Log-Likelihood:                -1503.8
No. Observations:                 506   AIC:                             3034.
Df Residuals:                     493   BIC:                             3089.
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         36.9250      5.148      7.173      0.000        26.811    47.039
x1            -0.1122      0.033     -3.405      0.001        -0.177    -0.047
x2             0.0470      0.014      3.396      0.001         0.020     0.074
x3             0.0406      0.062      0.659      0.510        -0.081     0.162
x4           -17.3970      3.852     -4.516      0.000       -24.966    -9.828
x5             3.8452      0.421      9.123      0.000         3.017     4.673
x6             0.0028      0.013      0.214      0.831        -0.023     0.029
x7            -1.4856      0.201     -7.383      0.000        -1.881    -1.090
x8             0.3279      0.067      4.928      0.000         0.197     0.459
x9            -0.0138      0.004     -3.651      0.000        -0.021    -0.006
x10           -0.9917      0.131     -7.547      0.000        -1.250    -0.734
x11            0.0098      0.003      3.635      0.000         0.005     0.015
x12           -0.5349      0.051    -10.479      0.000        -0.635    -0.435
==============================================================================
Omnibus:                      190.837   Durbin-Watson:                   1.015
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              897.143
Skew:                           1.619   Prob(JB):                    1.54e-195
Kurtosis:                       8.663   Cond. No.                     1.51e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.51e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

我还发现了以下文章

Find p-value (significance) in scikit-learn LinearRegression

http://connor-johnson.com/2014/02/18/linear-regression-with-python/

SO 链接中的两个代码都无法编译

这是我正在处理的代码和数据 - 但无法找到标准错误和 p 值

import pandas as pd
import statsmodels.api as sm
import numpy as np
import scipy
from sklearn.linear_model import LinearRegression
from sklearn import metrics 


def readFile(filename, sheetname):
    xlsx = pd.ExcelFile(filename)
    data = xlsx.parse(sheetname, skiprows=1)
    return data


def lr_statsmodel(X,y):
    X = sm.add_constant(X)
    model = sm.OLS(y,X)
    results = model.fit()
    print (results.summary())


def lr_scikit(X,y,featureCols):
    model = LinearRegression()
    results = model.fit(X,y)

    predictions =  results.predict(X)

    print 'Coefficients'
    print 'Intercept\t' , results.intercept_
    df = pd.DataFrame(zip(featureCols, results.coef_))
    print df.to_string(index=False, header=False)

    # Query:: The numbers matches with Excel OLS but skeptical about relating score as rsquared
    rSquare = results.score(X,y)
    print '\nR-Square::', rSquare

    # This looks like a better option
    # source: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score
    r2 = metrics.r2_score(y,results.predict(X))
    print 'r2', r2

    # Query: No clue at all! http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics 
    print 'Rsquared?!' , metrics.explained_variance_score(y, results.predict(X))
    # INFO:: All three of them are providing the same figures!     


    # Adj-Rsquare formula @ https://www.easycalculation.com/statistics/learn-adjustedr2.php
    # In ML, we don't use all of the data for training, and hence its highly unusual to find AdjRsquared. Thus the need for manual calculation
    N = X.shape[0]
    p = X.shape[1]
    adjRsquare = 1 - ((1 -  rSquare ) * (N - 1) / (N - p - 1))
    print "Adjusted R-Square::", adjRsquare

    # calculate standard errors
    # mean_absolute_error
    # mean_squared_error
    # median_absolute_error 
    # r2_score
    # explained_variance_score
    mse = metrics.mean_squared_error(y,results.predict(X))
    print mse
    print 'Residual Standard Error:', np.sqrt(mse)

    # OLS in Matrix : https://github.com/nsh87/regressors/blob/master/regressors/stats.py
    n = X.shape[0]
    X1 = np.hstack((np.ones((n, 1)), np.matrix(X)))    
    se_matrix = scipy.linalg.sqrtm(
        metrics.mean_squared_error(y, results.predict(X)) *
        np.linalg.inv(X1.T * X1)
    )
    print 'se',np.diagonal(se_matrix)

#    https://github.com/nsh87/regressors/blob/master/regressors/stats.py
#    http://regressors.readthedocs.io/en/latest/usage.html

    y_hat = results.predict(X)
    sse = np.sum((y_hat - y) ** 2)
    print 'Standard Square Error of the Model:', sse




if __name__ == '__main__':

    # read file 
    fileData = readFile('Linear_regression.xlsx','Input Data')

    # list of independent variables 
    feature_cols = ['Price per week','Population of city','Monthly income of riders','Average parking rates per month']

    # build dependent & independent data set 
    X = fileData[feature_cols]
    y = fileData['Number of weekly riders']

    # Statsmodel - OLS 
#    lr_statsmodel(X,y)

    # ScikitLearn - OLS 
    lr_scikit(X,y,feature_cols)

我的数据集

Y   X1  X2  X3  X4
City    Number of weekly riders Price per week  Population of city  Monthly income of riders    Average parking rates per month
1   1,92,000    $15     18,00,000   $5,800  $50
2   1,90,400    $15     17,90,000   $6,200  $50
3   1,91,200    $15     17,80,000   $6,400  $60
4   1,77,600    $25     17,78,000   $6,500  $60
5   1,76,800    $25     17,50,000   $6,550  $60
6   1,78,400    $25     17,40,000   $6,580  $70
7   1,80,800    $25     17,25,000   $8,200  $75
8   1,75,200    $30     17,25,000   $8,600  $75
9   1,74,400    $30     17,20,000   $8,800  $75
10  1,73,920    $30     17,05,000   $9,200  $80
11  1,72,800    $30     17,10,000   $9,630  $80
12  1,63,200    $40     17,00,000   $10,570 $80
13  1,61,600    $40     16,95,000   $11,330 $85
14  1,61,600    $40     16,95,000   $11,600 $100
15  1,60,800    $40     16,90,000   $11,800 $105
16  1,59,200    $40     16,30,000   $11,830 $105
17  1,48,800    $65     16,40,000   $12,650 $105
18  1,15,696    $102    16,35,000   $13,000 $110
19  1,47,200    $75     16,30,000   $13,224 $125
20  1,50,400    $75     16,20,000   $13,766 $130
21  1,52,000    $75     16,15,000   $14,010 $150
22  1,36,000    $80     16,05,000   $14,468 $155
23  1,26,240    $86     15,90,000   $15,000 $165
24  1,23,888    $98     15,95,000   $15,200 $175
25  1,26,080    $87     15,90,000   $15,600 $175
26  1,51,680    $77     16,00,000   $16,000 $190
27  1,52,800    $63     16,10,000   $16,200 $200

我已经用尽了我所有的选择和任何我能理解的东西。因此，任何关于如何计算与 statsmodel.api 相同的标准误差和 p 值的指导都值得赞赏。

编辑:我试图找到截距和所有自变量的标准错误和 p 值

最佳答案

这是 reg 是 sklearn 的 lin 回归拟合方法的输出
计算调整后的 r2

def adjustedR2(x,y reg):
  r2 = reg.score(x,y)
  n = x.shape[0]
  p = x.shape[1]
  adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
  return adjusted_r2

对于 p 值

from sklearn.feature_selection import f_regression

freg=f_regression(x,y)

p=freg[1]

print(p.round(3))

关于python - Scikit-Learn:Std.Error，来自 LinearRegression 的 p 值，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39066567/

文章推荐： python - scikit-learn 中是否有任何类型的子空间聚类包可用

文章推荐： SQL 使用 CMD 将数据从 .sql 文件插入到表中

文章推荐： c++ - C++中的位 vector

文章推荐： android - RecyclerView smoothScrollToPosition 小距离太快

python - Python 中的集群或合并集群以减少组数 (Python)
我正在处理一组标记为 160 个组的 173k 点。我想通过合并最接近的(到 9 或 10 个组)来减少组/集群的数量。我搜索过 sklearn 或类似的库，但没有成功。我猜它只是通过 knn 聚类
python - python 列表的子集基于同一列表的元素组，pythonically
我有一个扁平数字列表，这些数字逻辑上以 3 为一组，其中每个三元组是 (number, __ignored, flag[0 or 1])，例如: [7,56,1, 8,0,0, 2,0,0, 6,1,
python - 激活 Python 虚拟环境并在另一个 Python 脚本中调用 Python 脚本
我正在使用 pipenv 来管理我的包。我想编写一个 python 脚本来调用另一个使用不同虚拟环境(VE)的 python 脚本。如何运行使用 VE1 的 python 脚本 1 并调用另一个 p
python - 在焕然一新的 Python 环境中以编程方式从 Python 内部执行 Python 文件
假设我有一个文件 script.py 位于 path = "foo/bar/script.py"。我正在寻找一种在 Python 中通过函数 execute_script() 从我的主要 Python
python - 从 python 脚本但在 python 脚本之外运行 python 脚本
这听起来像是谜语或笑话，但实际上我还没有找到这个问题的答案。问题到底是什么？我想运行 2 个脚本。在第一个脚本中，我调用另一个脚本，但我希望它们继续并行，而不是在两个单独的线程中。主要是我不希望第
python - 使用不同的 python 从 python 运行 python 脚本
我有一个带有 python 2.5.5 的软件。我想发送一个命令，该命令将在 python 2.7.5 中启动一个脚本，然后继续执行该脚本。我试过用 #!python2.7.5 和http://re
python - 为什么从 Python 命令行调用 Python 时 Python 无法找到并运行我的脚本？
我在 python 命令行(使用 python 2.7)中，并尝试运行 Python 脚本。我的操作系统是 Windows 7。我已将我的目录设置为包含我所有脚本的文件夹，使用: os.chdir("
python - 使用动态版本的 Python 执行嵌入的 Python 代码时出现致命的 Python 错误
剧透:部分解决(见最后)。以下是使用 Python 嵌入的代码示例: #include int main(int argc, char** argv) { Py_SetPythonHome
python - python 中识别 python 数组或列表中最大累积差异的最快方法是什么？
假设我有以下列表，对应于及时的股票价格: prices = [1, 3, 7, 10, 9, 8, 5, 3, 6, 8, 12, 9, 6, 10, 13, 8, 4, 11] 我想确定以下总体上最
python - (Python) 通过单选按钮 python 更新背景
所以我试图在选择某个单选按钮时更改此框架的背景。我的框架位于一个类中，并且单选按钮的功能位于该类之外。 (这样我就可以在所有其他框架上调用它们。) 问题是每当我选择单选按钮时都会出现以下错误: co
python - python 中的字符串与正则表达式比较在 python 中失败
我正在尝试将字符串与 python 中的正则表达式进行比较，如下所示， #!/usr/bin/env python3 import re str1 = "Expecting property name
python - python 如何加载Boost.Python 库？
考虑以下原型(prototype) Boost.Python 模块，该模块从单独的 C++ 头文件中引入类“D”。 /* file: a/b.cpp */ BOOST_PYTHON_MODULE(c)
python - python 检查模块 python 的问题
如何编写一个程序来“识别函数调用的行号？” python 检查模块提供了定位行号的选项，但是， def di(): return inspect.currentframe().f_back.f_l
python - 系统 python 与用户 python
我已经使用 macports 安装了 Python 2.7，并且由于我的 $PATH 变量，这就是我输入 $ python 时得到的变量。然而，virtualenv 默认使用 Python 2.6，除
python - [Python] : Python re. 长字符串行的搜索速度优化
我只想问如何加快 python 上的 re.search 速度。我有一个很长的字符串行，长度为 176861(即带有一些符号的字母数字字符)，我使用此函数测试了该行以进行研究: def getExe
python - 编辑字符串 python 正则表达式 python
list1= [u'%app%%General%%Council%', u'%people%', u'%people%%Regional%%Council%%Mandate%', u'%ppp%%Ge
python - Python 映射中的副作用(Python "do" block )
这个问题在这里已经有了答案: Is it Pythonic to use list comprehensions for just side effects? (7 个答案) 关闭 4 个月前。告
python - 使用其值逻辑组合两个 python 列表 - Python
我想用 Python 将两个列表组合成一个列表，方法如下: a = [1,1,1,2,2,2,3,3,3,3] b= ["Sun", "is", "bright", "June","and" ,"Ju
python - Boost.Python python 链接错误
我正在运行带有最新 Boost 发行版 (1.55.0) 的 Mac OS X 10.8.4 (Darwin 12.4.0)。我正在按照说明 here构建包含在我的发行版中的教程 Boost-Pyth
python - 在 Python 中仅使用内置库制作一个基本的网络抓取工具 - Python
学习 Python，我正在尝试制作一个没有任何第 3 方库的网络抓取工具，这样过程对我来说并没有简化，而且我知道我在做什么。我浏览了一些在线资源，但所有这些都让我对某些事情感到困惑。 html 看起来

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - Scikit-Learn:Std.Error，来自 LinearRegression 的 p 值