python-2.7 - 在 Python : LinAlgError 中建模时检测 mulicollinear 或具有线性组合的列-6ren

python-2.7 - 在 Python : LinAlgError 中建模时检测 mulicollinear 或具有线性组合的列

转载作者：行者123 更新时间：2023-12-04 08:08:17

我正在为具有 34 个因变量的 logit 模型建模数据，并且它不断抛出奇异矩阵错误，如下所示 -:

Traceback (most recent call last):
  File "<pyshell#1116>", line 1, in <module>
    test_scores  = smf.Logit(m['event'], train_cols,missing='drop').fit()
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 1186, in fit
    disp=disp, callback=callback, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 164, in fit
    disp=disp, callback=callback, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 357, in fit
    hess=hess)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 405, in _fit_mle_newton
    newparams = oldparams - np.dot(np.linalg.inv(H),
  File "/usr/local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 445, in inv
    return wrap(solve(a, identity(a.shape[0], dtype=a.dtype)))
  File "/usr/local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 328, in solve
    raise LinAlgError, 'Singular matrix'
LinAlgError: Singular matrix

那是我偶然发现这种方法将矩阵减少到其独立列的时候

def independent_columns(A, tol = 0):#1e-05):
    """
    Return an array composed of independent columns of A.

    Note the answer may not be unique; this function returns one of many
    possible answers.

    https://stackoverflow.com/q/13312498/190597 (user1812712)
    http://math.stackexchange.com/a/199132/1140 (Gerry Myerson)
    http://mail.scipy.org/pipermail/numpy-discussion/2008-November/038705.html
        (Anne Archibald)

    >>> A = np.array([(2,4,1,3),(-1,-2,1,0),(0,0,2,2),(3,6,2,5)])
    2 4 1 3
    -1 -2 1 0
    0 0 2 2
    3 6 2 5
    # try with checking the rank of matrixs 
    >>> independent_columns(A)
    np.array([[1, 4],
              [2, 5],
              [3, 6]])
    """
    Q, R = linalg.qr(A)
    independent = np.where(np.abs(R.diagonal()) > tol)[0]
    #print independent
    return A[:, independent], independent


A,independent_col_indexes=independent_columns(train_cols.as_matrix(columns=None)) 
#train_cols will not be converted back from a df to a  matrix object,so doing this explicitly
A2=pd.DataFrame(A, columns=train_cols.columns[independent_col_indexes])

test_scores = smf.Logit(m['event'],A2,missing='drop').fit()

我仍然得到 LinAlgError ，尽管我希望我现在可以降低矩阵等级。

另外，我看到 np.linalg.matrix_rank(train_cols)返回 33(即在调用 Independent_columns 函数之前总共“x”列是 34(即 len(train_cols.ix[0])=34 )，这意味着我没有满秩矩阵)，而 np.linalg.matrix_rank(A2)返回 33(意味着我删除了一列，但我仍然看到 LinAlgError ，当我运行 test_scores = smf.Logit(m['event'],A2,missing='drop').fit() 时，我错过了什么？

引用上面的代码 -
How to find degenerate rows/columns in a covariance matrix

我试图通过一次引入每个变量来开始构建模型，这不会给我带来奇异矩阵错误，但我宁愿有一个确定性的方法，让我知道，我做错了什么&如何消除这些列。

编辑(更新后@@
user333700 以下)

1. 你是对的， "A2"没有 33 的降低等级。 IE。 len(A2.ix[0]) =34 -> 意味着可能共线的列不会被删除 - 我应该增加“tol”，公差以获得 A2 的等级(及其列数)，为 33。如果我将上面的 tol 更改为“1e-05”，然后我得到 len(A2.ix[0]) =33 ，这表明 tol >0(严格来说)是一个指标。
在此之后，我只是做了同样的事情， test_scores = smf.Logit(m['event'],A2,missing='drop').fit() , 没有 nm 来获得收敛。

2. 尝试“nm”方法后出错。但奇怪的是，如果我只取 20,000 行，我确实得到了结果。由于它没有显示内存错误，而是“ Inverting hessian failed, no bse or cov_params available” - 我假设有多个几乎相似的记录 - 你会怎么说？

m  = smf.Logit(data['event_custom'].ix[0:1000000] , train_cols.ix[0:1000000],missing='drop')
test_scores=m.fit(start_params=None,method='nm',maxiter=200,full_output=1)
Warning: Maximum number of iterations has been exceeded

Warning (from warnings module):
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 374
    warn(warndoc, Warning)
Warning: Inverting hessian failed, no bse or cov_params available


test_scores.summary()

Traceback (most recent call last):
  File "<pyshell#17>", line 1, in <module>
    test_scores.summary()
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 2396, in summary
    yname_list)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 2253, in summary
    use_t=False)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/iolib/summary.py", line 826, in add_table_params
    use_t=use_t)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/iolib/summary.py", line 447, in summary_params
    std_err = results.bse
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/tools/decorators.py", line 95, in __get__
    _cachedval = self.fget(obj)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 1037, in bse
    return np.sqrt(np.diag(self.cov_params()))
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 1102, in cov_params
    raise ValueError('need covariance of parameters for computing '
ValueError: need covariance of parameters for computing (unnormalized) covariances

编辑 2: (更新后@user333700 的建议如下)

Reiterating what I am trying to model - less than about 1% of total users "convert" (success outcomes) - so I took a balanced sample of 35(+ve) /65 (-ve)

我怀疑该模型并不稳健，尽管它收敛了。因此，将使用“start_params”作为来自不同数据集的早期迭代的参数。
此编辑是关于确认“start_params”是否可以输入到结果中，如下所示 -:

A,independent_col_indexes=independent_columns(train_cols.as_matrix(columns=None))
A2=pd.DataFrame(A, columns=train_cols.columns[independent_col_indexes])
m  = smf.Logit(data['event_custom'], A2,missing='drop')
#m  = smf.Logit(data['event_custom'], train_cols,missing='drop')#,method='nm').fit()#This doesnt work, so tried 'nm' which work, but used lasso, as nm did not converge.
test_scores=m.fit_regularized(start_params=None, method='l1', maxiter='defined_by_method', full_output=1, disp=1, callback=None, alpha=0, \
trim_mode='auto', auto_trim_tol=0.01, size_trim_tol=0.0001, qc_tol=0.03)

a_good_looking_previous_result.params=test_scores.params #storing the parameters of pass1 to feed into pass2

test_scores.params
bidfloor_Quartile_modified_binned_0               0.305765
connectiontype_binned_0                          -0.436798
day_custom_binned_Fri                            -0.040269
day_custom_binned_Mon                             0.138599
day_custom_binned_Sat                            -0.319997
day_custom_binned_Sun                            -0.236507
day_custom_binned_Thu                            -0.058922
user_agent_device_family_binned_iPad            -10.793270
user_agent_device_family_binned_iPhone           -8.483099
user_agent_masterclass_binned_apple               9.038889
user_agent_masterclass_binned_generic            -0.760297
user_agent_masterclass_binned_samsung            -0.063522
log_height_width                                  0.593199
log_height_width_ScreenResolution                -0.520836
productivity                                     -1.495373
games                                             0.706340
entertainment                                    -1.806886
IAB24                                             2.531467
IAB17                                             0.650327
IAB14                                             0.414031
utilities                                         9.968253
IAB1                                              1.850786
social_networking                                -2.814148
IAB3                                             -9.230780
music                                             0.019584
IAB9                                             -0.415559
C(time_day_modified)[(6, 12]]:C(country)[AUS]    -0.103003
C(time_day_modified)[(0, 6]]:C(country)[HKG]      0.769272
C(time_day_modified)[(6, 12]]:C(country)[HKG]     0.406882
C(time_day_modified)[(0, 6]]:C(country)[IDN]      0.073306
C(time_day_modified)[(6, 12]]:C(country)[IDN]    -0.207568
C(time_day_modified)[(0, 6]]:C(country)[IND]      0.033370
... more params here

现在在不同的数据集(pass2，用于索引)上，我的模型与以下相同 -:
IE。我读了一个新的数据框，做所有的变量转换，然后像之前一样通过 Logit 建模。

m_pass2  = smf.Logit(data['event_custom'], A2_pass2,missing='drop')
test_scores_pass2=m_pass2.fit_regularized(start_params=a_good_looking_previous_result.params, method='l1', maxiter='defined_by_method', full_output=1, disp=1, callback=None, alpha=0, \
trim_mode='auto', auto_trim_tol=0.01, size_trim_tol=0.0001, qc_tol=0.03)

并且，可能通过从早期的传递中选取“start_params”来继续迭代。

最佳答案

对此有几点说明:

您需要 tol > 0 来检测接近完美的共线性，这也可能会在以后的计算中导致数值问题。
查看A2的列数查看列是否真的被删除了。

Logit 需要对 exog 进行一些非线性计算，因此即使设计矩阵不是非常接近完美共线性，对数似然、导数或 Hessian 计算的变换变量最终可能仍会遇到数值问题，例如单一的黑森州。

(当我们在浮点精度 1e-15、1e-16 附近工作时，所有这些都是浮点问题。matrix_rank 和类似 linalg 函数的默认阈值有时存在差异，这可能意味着在某些边缘情况下，一个函数将其识别为单数，另一个没有。)

包括 Logit 在内的离散模型的默认优化方法是简单的 Newton 方法，它在相当不错的情况下很快，但在条件不佳的情况下可能会失败。您可以尝试其他优化器之一，这将是 scipy.optimize 中的优化器之一，method='nm'通常非常健壮但很慢，method='bfgs'在许多情况下效果很好，但也可能遇到收敛问题。

尽管如此，即使其他优化方法之一成功，仍然需要检查结果。通常情况下，一种方法的失败意味着模型或估计问题可能没有明确定义。

检查是否只是错误的起始值问题或规范问题的一个好方法是运行 method='nm'首先，然后运行一种更准确的方法，如 newton或 bfgs使用 nm估计作为起始值，并从好的起始值看它是否成功。

关于python-2.7 - 在 Python : LinAlgError 中建模时检测 mulicollinear 或具有线性组合的列，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23848003/

文章推荐： console-application - 使用选项自动化 DOS 程序的批处理脚本

文章推荐： powershell - 如何将特定类型的所有文件上传到 S3 Bucket？

python - Python 中的集群或合并集群以减少组数 (Python)
我正在处理一组标记为 160 个组的 173k 点。我想通过合并最接近的(到 9 或 10 个组)来减少组/集群的数量。我搜索过 sklearn 或类似的库，但没有成功。我猜它只是通过 knn 聚类
python - python 列表的子集基于同一列表的元素组，pythonically
我有一个扁平数字列表，这些数字逻辑上以 3 为一组，其中每个三元组是 (number, __ignored, flag[0 or 1])，例如: [7,56,1, 8,0,0, 2,0,0, 6,1,
python - 激活 Python 虚拟环境并在另一个 Python 脚本中调用 Python 脚本
我正在使用 pipenv 来管理我的包。我想编写一个 python 脚本来调用另一个使用不同虚拟环境(VE)的 python 脚本。如何运行使用 VE1 的 python 脚本 1 并调用另一个 p
python - 在焕然一新的 Python 环境中以编程方式从 Python 内部执行 Python 文件
假设我有一个文件 script.py 位于 path = "foo/bar/script.py"。我正在寻找一种在 Python 中通过函数 execute_script() 从我的主要 Python
python - 从 python 脚本但在 python 脚本之外运行 python 脚本
这听起来像是谜语或笑话，但实际上我还没有找到这个问题的答案。问题到底是什么？我想运行 2 个脚本。在第一个脚本中，我调用另一个脚本，但我希望它们继续并行，而不是在两个单独的线程中。主要是我不希望第
python - 使用不同的 python 从 python 运行 python 脚本
我有一个带有 python 2.5.5 的软件。我想发送一个命令，该命令将在 python 2.7.5 中启动一个脚本，然后继续执行该脚本。我试过用 #!python2.7.5 和http://re
python - 为什么从 Python 命令行调用 Python 时 Python 无法找到并运行我的脚本？
我在 python 命令行(使用 python 2.7)中，并尝试运行 Python 脚本。我的操作系统是 Windows 7。我已将我的目录设置为包含我所有脚本的文件夹，使用: os.chdir("
python - 使用动态版本的 Python 执行嵌入的 Python 代码时出现致命的 Python 错误
剧透:部分解决(见最后)。以下是使用 Python 嵌入的代码示例: #include int main(int argc, char** argv) { Py_SetPythonHome
python - python 中识别 python 数组或列表中最大累积差异的最快方法是什么？
假设我有以下列表，对应于及时的股票价格: prices = [1, 3, 7, 10, 9, 8, 5, 3, 6, 8, 12, 9, 6, 10, 13, 8, 4, 11] 我想确定以下总体上最
python - (Python) 通过单选按钮 python 更新背景
所以我试图在选择某个单选按钮时更改此框架的背景。我的框架位于一个类中，并且单选按钮的功能位于该类之外。 (这样我就可以在所有其他框架上调用它们。) 问题是每当我选择单选按钮时都会出现以下错误: co
python - python 中的字符串与正则表达式比较在 python 中失败
我正在尝试将字符串与 python 中的正则表达式进行比较，如下所示， #!/usr/bin/env python3 import re str1 = "Expecting property name
python - python 如何加载Boost.Python 库？
考虑以下原型(prototype) Boost.Python 模块，该模块从单独的 C++ 头文件中引入类“D”。 /* file: a/b.cpp */ BOOST_PYTHON_MODULE(c)
python - python 检查模块 python 的问题
如何编写一个程序来“识别函数调用的行号？” python 检查模块提供了定位行号的选项，但是， def di(): return inspect.currentframe().f_back.f_l
python - 系统 python 与用户 python
我已经使用 macports 安装了 Python 2.7，并且由于我的 $PATH 变量，这就是我输入 $ python 时得到的变量。然而，virtualenv 默认使用 Python 2.6，除
python - [Python] : Python re. 长字符串行的搜索速度优化
我只想问如何加快 python 上的 re.search 速度。我有一个很长的字符串行，长度为 176861(即带有一些符号的字母数字字符)，我使用此函数测试了该行以进行研究: def getExe
python - 编辑字符串 python 正则表达式 python
list1= [u'%app%%General%%Council%', u'%people%', u'%people%%Regional%%Council%%Mandate%', u'%ppp%%Ge
python - Python 映射中的副作用(Python "do" block )
这个问题在这里已经有了答案: Is it Pythonic to use list comprehensions for just side effects? (7 个答案) 关闭 4 个月前。告
python - 使用其值逻辑组合两个 python 列表 - Python
我想用 Python 将两个列表组合成一个列表，方法如下: a = [1,1,1,2,2,2,3,3,3,3] b= ["Sun", "is", "bright", "June","and" ,"Ju
python - Boost.Python python 链接错误
我正在运行带有最新 Boost 发行版 (1.55.0) 的 Mac OS X 10.8.4 (Darwin 12.4.0)。我正在按照说明 here构建包含在我的发行版中的教程 Boost-Pyth
python - 在 Python 中仅使用内置库制作一个基本的网络抓取工具 - Python
学习 Python，我正在尝试制作一个没有任何第 3 方库的网络抓取工具，这样过程对我来说并没有简化，而且我知道我在做什么。我浏览了一些在线资源，但所有这些都让我对某些事情感到困惑。 html 看起来

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python-2.7 - 在 Python : LinAlgError 中建模时检测 mulicollinear 或具有线性组合的列