python - Python 和 Stata 多项式 Logit 模型不同的结果-6ren

python - Python 和 Stata 多项式 Logit 模型不同的结果

转载作者：太空狗更新时间：2023-10-30 02:53:33

我正在尝试使用 python 和 stata 构建多项式 logit 模型。我的数据如下:

    ses_type prog_type  read  write  math  prog  ses 
0        low   Diploma  39.2   40.2  46.2     0     0
1     middle   general  39.2   38.2  46.2     1     1
2       high   Diploma  44.5   44.5  49.5     0     2
3        low   Diploma  43.0   43.0  48.0     0     0
4     middle   Diploma  44.5   36.5  45.5     0     1
5       high   general  47.3   41.3  47.3     1     2

我正在尝试使用ses 读写和数学 预测prog。其中 ses 代表社会经济地位并且是一个名义变量，因此我使用以下命令在 stata 中创建了我的模型:

mlogit prog i.ses read write math, base(2)

Stata输出如下:

Iteration 0:   log likelihood = -204.09667  
Iteration 1:   log likelihood = -171.90258  
Iteration 2:   log likelihood = -170.13513  
Iteration 3:   log likelihood = -170.11071  
Iteration 4:   log likelihood =  -170.1107  

Multinomial logistic regression                 Number of obs     =        200
                                                LR chi2(10)       =      67.97
                                                Prob > chi2       =     0.0000
Log likelihood =  -170.1107                     Pseudo R2         =     0.1665

------------------------------------------------------------------------------
        prog |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
0            |
         ses |
          1  |   .6197969   .5059335     1.23   0.221    -.3718146    1.611408
          2  |  -.5131952   .6280601    -0.82   0.414     -1.74417    .7177799
             |
        read |  -.0405302   .0289314    -1.40   0.161    -.0972346    .0161742
       write |  -.0459711   .0270153    -1.70   0.089      -.09892    .0069779
        math |  -.0990497   .0331576    -2.99   0.003    -.1640373   -.0340621
       _cons |   9.544131   1.738404     5.49   0.000     6.136921    12.95134
-------------+----------------------------------------------------------------
1            |
         ses |
          1  |  -.3350861   .4607246    -0.73   0.467     -1.23809    .5679176
          2  |  -.8687013   .5363968    -1.62   0.105     -1.92002     .182617
             |
        read |  -.0226249   .0264534    -0.86   0.392    -.0744726    .0292228
       write |   -.011618   .0266782    -0.44   0.663    -.0639063    .0406703
        math |  -.0591301   .0299996    -1.97   0.049    -.1179283    -.000332
       _cons |   5.041193   1.524174     3.31   0.001     2.053866    8.028519
-------------+----------------------------------------------------------------
2            |  (base outcome)
------------------------------------------------------------------------------

我尝试使用 python 中的 scikit 学习模块复制相同的结果。以下是代码:

data = pd.read_csv("C://Users/Furqan/Desktop/random_data.csv")


train_x = np.array(data[['read', 'write', 'math','ses ']])
train_y = np.array(data['prog'])

mul_lr = linear_model.LogisticRegression(multi_class='multinomial',
                                         solver='newton-cg').fit(train_x, train_y)

print(mul_lr.intercept_)
print(mul_lr.coef_)

输出值(截距和系数)如下:

[ 4.76438772  0.19347405 -4.95786177]

[[-0.01735513 -0.02731273 -0.04463257  0.01721334]
 [-0.00319366  0.00783135 -0.00689664 -0.24480926]
 [ 0.02054879  0.01948137  0.05152921  0.22759592]]

结果结果是不同的。

我的第一个问题是为什么结果往往不同？

我的第二个问题是，在标称预测变量的情况下，我们如何指示 python ses 是一个指示变量？

编辑:

Link到数据文件

最佳答案

有几个问题导致 Stata 和 sklearn 结果不同:

Stata 和 sklearn 中不同的实际预测器
拟合参数的不同表示
拟合模型时的不同目标函数

我们需要更改所有三个条件才能获得相似的输出。

1。制作虚拟变量

Stata 用于线性部分的公式是

 prediction = a0 + a1 * [ses==1] + a2 * [ses==2] + a3 * read + a4 * write + a5 * math

Sklearn 反过来对 ses 的分类性质一无所知，并尝试使用

 prediction = a0 + a1 * ses + a3 * read + a4 * write + a5 * math

要启用分类预测，您需要对数据进行预处理。这是将分类变量包含到 sklearn 逻辑回归中的唯一可能方法。我发现 pd.get_dummies() 是最方便的方法。

以下代码为 ses 创建虚拟变量，然后降低 “low” 级别，这显然对应于 ses=0你的例子:

import pandas as pd, numpy as np
from sklearn import linear_model

data = pd.read_csv("d1.csv", sep='\t')
data.columns = data.columns.str.strip()

raw_x = data.drop('prog', axis=1)
# making the dummies
train_x = pd.get_dummies(raw_x, columns=['ses']).drop('ses_low ', axis=1)
print(train_x.columns)
train_y = data['prog']

mul_lr = linear_model.LogisticRegression(multi_class='multinomial',
                                         solver='newton-cg').fit(train_x, train_y)
reorder = [4, 3, 0, 1, 2] # the order in which coefficents show up in Stata

print(mul_lr.intercept_)
print(mul_lr.coef_[:, reorder])

输出

['read', 'write', 'math', 'ses_high ', 'ses_middle ']
[ 4.67331919  0.19082335 -4.86414254]
[[ 0.47140512 -0.08236331 -0.01909793 -0.02680609 -0.04587383]
 [-0.36381476 -0.33294749 -0.0021255   0.00765828 -0.00703075]
 [-0.10759035  0.4153108   0.02122343  0.01914781  0.05290458]]

您看到 Python 已成功将 sess 编码为 'ses_high ' 和 'ses_middle '，但未能产生预期的系数。

顺便说一句，我更改了输出中 coef_ 列的顺序，使其看起来像在 Stata 中。

2。重新排列结果

这是因为 Stata 将结果的第三类 (prog=='honors ') 视为基本结果，并从其余参数中减去其所有参数.在 Python 中，您可以通过运行重现它

print(mul_lr.intercept_ - mul_lr.intercept_[-1])
print((mul_lr.coef_  - mul_lr.coef_[-1])[:, reorder])

这给了你

[9.53746174 5.0549659  0.        ]
[[ 0.57899547 -0.4976741  -0.04032136 -0.0459539  -0.09877841]
 [-0.25622441 -0.74825829 -0.02334893 -0.01148954 -0.05993533]
 [ 0.          0.          0.          0.          0.        ]]

现在您可以看到参数现在接近 Stata 给出的参数:

截取 Python 中的 (9.53, 5.05) 与 Stata 中的 (9.54, 5.04)
第一结果系数 (0.57, -0.49, ...) 对比 (0.61, -0.51, ...)
第二个结果系数 (-0.25, -0.74, ...) 对比 (-0.33, -0.86, ...)

你能看出规律吗？在 sklearn 中，斜率系数比在 Stata 中更小(接近于零)。这不是意外!

3。处理正则化

发生这种情况是因为 sklearn 有意将斜率系数缩小到 0，方法是在系数上对其最大化的似然函数添加二次惩罚。这使得估计有偏差但更稳定，即使在严重的多重共线性的情况下也是如此。用贝叶斯术语来说，这种正则化对应于所有系数的零均值高斯先验。您可以了解有关正则化的更多信息 in the wiki .

在 sklearn 中，此二次惩罚由正 C 参数控制:它越小，正则化程度越高。您可以将其视为每个斜率系数的先验方差。默认值是C=1，但是你可以把它变大，比如C=1000000，这意味着几乎没有正则化。在这种情况下，输出几乎与 Stata 相同:

mul_lr2 = linear_model.LogisticRegression(
    multi_class='multinomial', solver='newton-cg', C=1000000
).fit(train_x, train_y)
print(mul_lr2.intercept_ - mul_lr2.intercept_[-1])
print((mul_lr2.coef_  - mul_lr2.coef_[-1])[:, reorder])

这给了你

[9.54412644 5.04126452 0.        ]
[[ 0.61978951 -0.51320481 -0.04053013 -0.0459711  -0.09904948]
 [-0.33508605 -0.86869799 -0.02262518 -0.01161839 -0.05913068]
 [ 0.          0.          0.          0.          0.        ]]

结果仍然略有不同(如小数点后 5 位)，但正则化更少时，差异填充会进一步缩小。

关于python - Python 和 Stata 多项式 Logit 模型不同的结果，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49086721/

文章推荐： python - 点到线段的投影 Python Shapely

文章推荐： python - 如何从 groupby 和 size 获取归一化值

文章推荐： python / Pandas : concatenate two dataframes with multi index

python - Python 中的集群或合并集群以减少组数 (Python)
我正在处理一组标记为 160 个组的 173k 点。我想通过合并最接近的(到 9 或 10 个组)来减少组/集群的数量。我搜索过 sklearn 或类似的库，但没有成功。我猜它只是通过 knn 聚类
python - python 列表的子集基于同一列表的元素组，pythonically
我有一个扁平数字列表，这些数字逻辑上以 3 为一组，其中每个三元组是 (number, __ignored, flag[0 or 1])，例如: [7,56,1, 8,0,0, 2,0,0, 6,1,
python - 激活 Python 虚拟环境并在另一个 Python 脚本中调用 Python 脚本
我正在使用 pipenv 来管理我的包。我想编写一个 python 脚本来调用另一个使用不同虚拟环境(VE)的 python 脚本。如何运行使用 VE1 的 python 脚本 1 并调用另一个 p
python - 在焕然一新的 Python 环境中以编程方式从 Python 内部执行 Python 文件
假设我有一个文件 script.py 位于 path = "foo/bar/script.py"。我正在寻找一种在 Python 中通过函数 execute_script() 从我的主要 Python
python - 从 python 脚本但在 python 脚本之外运行 python 脚本
这听起来像是谜语或笑话，但实际上我还没有找到这个问题的答案。问题到底是什么？我想运行 2 个脚本。在第一个脚本中，我调用另一个脚本，但我希望它们继续并行，而不是在两个单独的线程中。主要是我不希望第
python - 使用不同的 python 从 python 运行 python 脚本
我有一个带有 python 2.5.5 的软件。我想发送一个命令，该命令将在 python 2.7.5 中启动一个脚本，然后继续执行该脚本。我试过用 #!python2.7.5 和http://re
python - 为什么从 Python 命令行调用 Python 时 Python 无法找到并运行我的脚本？
我在 python 命令行(使用 python 2.7)中，并尝试运行 Python 脚本。我的操作系统是 Windows 7。我已将我的目录设置为包含我所有脚本的文件夹，使用: os.chdir("
python - 使用动态版本的 Python 执行嵌入的 Python 代码时出现致命的 Python 错误
剧透:部分解决(见最后)。以下是使用 Python 嵌入的代码示例: #include int main(int argc, char** argv) { Py_SetPythonHome
python - python 中识别 python 数组或列表中最大累积差异的最快方法是什么？
假设我有以下列表，对应于及时的股票价格: prices = [1, 3, 7, 10, 9, 8, 5, 3, 6, 8, 12, 9, 6, 10, 13, 8, 4, 11] 我想确定以下总体上最
python - (Python) 通过单选按钮 python 更新背景
所以我试图在选择某个单选按钮时更改此框架的背景。我的框架位于一个类中，并且单选按钮的功能位于该类之外。 (这样我就可以在所有其他框架上调用它们。) 问题是每当我选择单选按钮时都会出现以下错误: co
python - python 中的字符串与正则表达式比较在 python 中失败
我正在尝试将字符串与 python 中的正则表达式进行比较，如下所示， #!/usr/bin/env python3 import re str1 = "Expecting property name
python - python 如何加载Boost.Python 库？
考虑以下原型(prototype) Boost.Python 模块，该模块从单独的 C++ 头文件中引入类“D”。 /* file: a/b.cpp */ BOOST_PYTHON_MODULE(c)
python - python 检查模块 python 的问题
如何编写一个程序来“识别函数调用的行号？” python 检查模块提供了定位行号的选项，但是， def di(): return inspect.currentframe().f_back.f_l
python - 系统 python 与用户 python
我已经使用 macports 安装了 Python 2.7，并且由于我的 $PATH 变量，这就是我输入 $ python 时得到的变量。然而，virtualenv 默认使用 Python 2.6，除
python - [Python] : Python re. 长字符串行的搜索速度优化
我只想问如何加快 python 上的 re.search 速度。我有一个很长的字符串行，长度为 176861(即带有一些符号的字母数字字符)，我使用此函数测试了该行以进行研究: def getExe
python - 编辑字符串 python 正则表达式 python
list1= [u'%app%%General%%Council%', u'%people%', u'%people%%Regional%%Council%%Mandate%', u'%ppp%%Ge
python - Python 映射中的副作用(Python "do" block )
这个问题在这里已经有了答案: Is it Pythonic to use list comprehensions for just side effects? (7 个答案) 关闭 4 个月前。告
python - 使用其值逻辑组合两个 python 列表 - Python
我想用 Python 将两个列表组合成一个列表，方法如下: a = [1,1,1,2,2,2,3,3,3,3] b= ["Sun", "is", "bright", "June","and" ,"Ju
python - Boost.Python python 链接错误
我正在运行带有最新 Boost 发行版 (1.55.0) 的 Mac OS X 10.8.4 (Darwin 12.4.0)。我正在按照说明 here构建包含在我的发行版中的教程 Boost-Pyth
python - 在 Python 中仅使用内置库制作一个基本的网络抓取工具 - Python
学习 Python，我正在尝试制作一个没有任何第 3 方库的网络抓取工具，这样过程对我来说并没有简化，而且我知道我在做什么。我浏览了一些在线资源，但所有这些都让我对某些事情感到困惑。 html 看起来

太空狗

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城