r - 包含 NA 的数据的聚集标准错误-6ren

r - 包含 NA 的数据的聚集标准错误

转载作者：行者123 更新时间：2023-12-04 12:31:55

我无法使用 R 和基于此的指南对标准错误进行聚类 post . cl 函数返回错误:

Error in tapply(x, cluster1, sum) : arguments must have same length

阅读 tapply 后我仍然不确定为什么我的集群参数长度错误，以及导致此错误的原因。

这是我正在使用的数据集的链接。

https://www.dropbox.com/s/y2od7um9pp4vn0s/Ec%201820%20-%20DD%20Data%20with%20Controls.csv

这是R代码:

# read in data
charter<-read.csv(file.choose())
View(charter)
colnames(charter)

# standardize NAEP scores
charter$naep.standardized <- (charter$naep - mean(charter$naep, na.rm=T))/sd(charter$naep, na.rm=T)

# change NAs in year.passed column to 2014
charter$year.passed[is.na(charter$year.passed)]<-2014

# Add column with indicator for in treatment (passed legislation)
charter$treatment<-ifelse(charter$year.passed<=charter$year,1,0)

# fit model
charter.model<-lm(naep ~ factor(year) + factor(state) + treatment, data = charter)
summary(charter.model)
# account for clustered standard errors by state
cl(dat=charter, fm=charter.model, cluster=charter$state)

# accounting for controls
charter.model.controls<-lm(naep~factor)

# clustered standard errors
# ---------

# function that calculates clustered standard errors
# source: http://thetarzan.wordpress.com/2011/06/11/clustered-standard-errors-in-r/
cl   <- function(dat, fm, cluster){
  require(sandwich, quietly = TRUE)
  require(lmtest, quietly = TRUE)
  M <- length(unique(cluster))
  N <- length(cluster)
  K <- fm$rank
  dfc <- (M/(M-1))*((N-1)/(N-K))
  print(K)
  uj  <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum));
  vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
  coeftest(fm, vcovCL) 
}

# calculate clustered standard errors 
cl(charter, charter.model, charter$state)

该函数的内部工作原理有点超出我的想象。

最佳答案

执行代码时，请注意线性模型中缺少观察结果:

> summary(charter.model)

Call:
lm(formula = naep ~ factor(year) + factor(state) + treatment, 
    data = charter)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.2420  -1.6740  -0.2024   1.8345  12.3580 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 250.4983     1.2115 206.767  < 2e-16 ***
factor(year)1992              3.7970     0.7198   5.275 2.17e-07 ***
factor(year)1996              7.0436     0.8607   8.183 3.64e-15 ***

[..]

Residual standard error: 3.128 on 404 degrees of freedom
  (759 observations deleted due to missingness)
Multiple R-squared:  0.9337,    Adjusted R-squared:  0.9239 
F-statistic: 94.85 on 60 and 404 DF,  p-value: < 2.2e-16

这就是导致 Error in tapply(x, cluster1, sum) : arguments must have same length 的原因您看到的错误消息。

在 cl(dat=charter, fm=charter.model, cluster=charter$state)集群变量 charter$state应该具有与回归估计中有效使用的观察数完全相同的长度(由于 NA 与原始数据框中的行数不同)。

要解决此问题，您可以执行以下操作。

首先，您使用的是 Arai 函数的旧版本( cl )(参见 Fama-MacBeth and Cluster-Robust (by Firm and Time) Standard Errors in R 以获取旧版本或新版本的引用，后者被称为 clx )。

其次，我认为 Arai 对这个功能的原始方法有点复杂，并没有真正遵循 vcov* 的标准接口(interface)。来自 sandwich 的函数.这就是为什么我带来了 clx 的略微修改版本。 .我使代码更具可读性，界面更像您对 sandwich 的期望。 vcov*功能:

vcovCL <- function(x, cluster.by, type="sss", dfcw=1){
    # R-codes (www.r-project.org) for computing
    # clustered-standard errors. Mahmood Arai, Jan 26, 2008.

    # The arguments of the function are:
    # fitted model, cluster1 and cluster2
    # You need to install libraries `sandwich' and `lmtest'

    # reweighting the var-cov matrix for the within model
    require(sandwich)
    cluster <- cluster.by
    M <- length(unique(cluster))   
    N <- length(cluster)
    stopifnot(N == length(x$residuals))
    K <- x$rank
    ##only Stata small-sample correction supported right now 
    ##see plm >= 1.5-4
    stopifnot(type=="sss")  
    if(type=="sss"){
        dfc <- (M/(M-1))*((N-1)/(N-K))
    }
    uj  <- apply(estfun(x), 2, function(y) tapply(y, cluster, sum))
    mycov <- dfc * sandwich(x, meat=crossprod(uj)/N) * dfcw
    return(mycov)
}

如果你在数据上尝试这个函数，你会发现它捕捉到了这个特定的问题:

> coeftest(charter.model, vcov=function(x) vcovCL(x, charter$state))
 Error: N == length(x$residuals) is not TRUE

为避免此问题，您可以按以下步骤操作:

> charter.x <- na.omit(charter[ , c("state", 
                                  all.vars(formula(charter.model)))])
> coeftest(charter.model, vcov=function(x) vcovCL(x, charter.x$state)) 

t test of coefficients:

                               Estimate  Std. Error     t value  Pr(>|t|)    
(Intercept)                  2.5050e+02  9.3781e-01  2.6711e+02 < 2.2e-16 ***
factor(year)1992             3.7970e+00  5.6019e-01  6.7780e+00 4.330e-11 ***
factor(year)1996             7.0436e+00  8.8574e-01  7.9522e+00 1.856e-14 ***
factor(year)2000             8.4313e+00  1.0906e+00  7.7311e+00 8.560e-14 ***
factor(year)2003             1.2392e+01  1.1670e+00  1.0619e+01 < 2.2e-16 ***
factor(year)2005             1.3490e+01  1.1747e+00  1.1484e+01 < 2.2e-16 ***
factor(year)2007             1.6334e+01  1.2469e+00  1.3100e+01 < 2.2e-16 ***
factor(year)2009             1.8118e+01  1.2556e+00  1.4430e+01 < 2.2e-16 ***
factor(year)2011             1.9110e+01  1.3459e+00  1.4199e+01 < 2.2e-16 ***
factor(year)2013             1.9301e+01  1.4896e+00  1.2957e+01 < 2.2e-16 ***
factor(state)Alaska          1.4178e+01  8.7686e-01  1.6169e+01 < 2.2e-16 ***
factor(state)Arizona         8.6313e+00  8.1439e-01  1.0598e+01 < 2.2e-16 ***
factor(state)Arkansas        4.3313e+00  8.1439e-01  5.3185e+00 1.736e-07 ***
factor(state)California      3.1103e+00  9.1619e-01  3.3948e+00 0.0007549 ***
factor(state)Colorado        1.7939e+01  7.9736e-01  2.2498e+01 < 2.2e-16 ***
factor(state)Connecticut     1.8031e+01  8.1439e-01  2.2141e+01 < 2.2e-16 ***
factor(state)D.C.           -1.8369e+01  8.1439e-01 -2.2555e+01 < 2.2e-16 ***
factor(state)Delaware        1.2050e+01  7.9736e-01  1.5113e+01 < 2.2e-16 ***
factor(state)Florida         7.3838e+00  7.9736e-01  9.2602e+00 < 2.2e-16 ***
factor(state)Georgia         6.4313e+00  8.1439e-01  7.8971e+00 2.724e-14 ***
factor(state)Hawaii          3.3313e+00  8.1439e-01  4.0906e+00 5.196e-05 ***
factor(state)Idaho           1.7118e+01  7.8321e-01  2.1857e+01 < 2.2e-16 ***
factor(state)Illinois        1.2670e+01  8.2224e-01  1.5409e+01 < 2.2e-16 ***
factor(state)Indianna        1.7174e+01  6.1079e-01  2.8117e+01 < 2.2e-16 ***
factor(state)Iowa            2.0074e+01  6.8460e-01  2.9322e+01 < 2.2e-16 ***
factor(state)Kansas          2.0123e+01  8.6796e-01  2.3184e+01 < 2.2e-16 ***
factor(state)Kentucky        1.0200e+01  4.1999e-14  2.4287e+14 < 2.2e-16 ***
factor(state)Louisiana      -1.6866e-01  8.1439e-01 -2.0710e-01 0.8360322    
factor(state)Maine           2.0231e+01  1.7564e-01  1.1518e+02 < 2.2e-16 ***
factor(state)Maryland        1.4274e+01  6.1079e-01  2.3369e+01 < 2.2e-16 ***
factor(state)Massachusetts   2.4868e+01  8.3960e-01  2.9619e+01 < 2.2e-16 ***
factor(state)Michigan        1.2031e+01  8.1439e-01  1.4773e+01 < 2.2e-16 ***
factor(state)Minnesota       2.5110e+01  9.1619e-01  2.7407e+01 < 2.2e-16 ***
factor(state)Mississippi    -3.5470e+00  1.7564e-01 -2.0195e+01 < 2.2e-16 ***
factor(state)Missouri        1.3447e+01  7.2706e-01  1.8495e+01 < 2.2e-16 ***
factor(state)Montana         2.2512e+01  8.4814e-01  2.6543e+01 < 2.2e-16 ***
factor(state)Nebraska        1.9600e+01  4.3105e-14  4.5471e+14 < 2.2e-16 ***
factor(state)Nevada          4.9800e+00  8.6796e-01  5.7375e+00 1.887e-08 ***
factor(state)New Hampshire   2.2026e+01  7.6338e-01  2.8853e+01 < 2.2e-16 ***
factor(state)New Jersey      2.0651e+01  7.6338e-01  2.7052e+01 < 2.2e-16 ***
factor(state)New Mexico      1.5313e+00  8.1439e-01  1.8803e+00 0.0607809 .  
factor(state)New York        1.2152e+01  7.1259e-01  1.7054e+01 < 2.2e-16 ***
factor(state)North Carolina  1.2231e+01  8.1439e-01  1.5019e+01 < 2.2e-16 ***
factor(state)North Dakota    2.4278e+01  1.0420e-01  2.3299e+02 < 2.2e-16 ***
factor(state)Ohio            1.7118e+01  7.8321e-01  2.1857e+01 < 2.2e-16 ***
factor(state)Oklahoma        8.4518e+00  7.8321e-01  1.0791e+01 < 2.2e-16 ***
factor(state)Oregon          1.6535e+01  7.3538e-01  2.2486e+01 < 2.2e-16 ***
factor(state)Pennsylvania    1.6651e+01  7.6338e-01  2.1812e+01 < 2.2e-16 ***
factor(state)Rhode Island    9.5313e+00  8.1439e-01  1.1704e+01 < 2.2e-16 ***
factor(state)South Carolina  9.5346e+00  8.3960e-01  1.1356e+01 < 2.2e-16 ***
factor(state)South Dakota    2.1211e+01  3.5103e-01  6.0425e+01 < 2.2e-16 ***
factor(state)Tennessee       4.9148e+00  6.1473e-01  7.9951e+00 1.375e-14 ***
factor(state)Texas           1.4231e+01  8.1439e-01  1.7475e+01 < 2.2e-16 ***
factor(state)Utah            1.5114e+01  7.2706e-01  2.0787e+01 < 2.2e-16 ***
factor(state)Vermont         2.3474e+01  2.0299e-01  1.1564e+02 < 2.2e-16 ***
factor(state)Virginia        1.6252e+01  7.1259e-01  2.2807e+01 < 2.2e-16 ***
factor(state)Washington      1.9073e+01  1.8183e-01  1.0489e+02 < 2.2e-16 ***
factor(state)West Virginia   5.0000e+00  4.2022e-14  1.1899e+14 < 2.2e-16 ***
factor(state)Wisconsin       1.9994e+01  8.2447e-01  2.4251e+01 < 2.2e-16 ***
factor(state)Wyoming         1.8231e+01  8.1439e-01  2.2386e+01 < 2.2e-16 ***
treatment                    1.2108e+00  1.0180e+00  1.1894e+00 0.2349682    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

这不是很好，但可以完成工作。现在 cl也可以正常工作并产生与上述相同的结果:

cl(dat=charter, fm=charter.model, cluster=charter.x$state)

解决此问题的更好方法是使用 multiwayvcov 包裹。根据包裹的 website ，这是对 Arai 代码的改进:

Transparent handling of observations dropped due to missingness

使用带有模拟 NA 和 cluster.vcov() 的 Petersen 数据:

library("lmtest")
library("multiwayvcov")

data(petersen)
set.seed(123)
petersen[ sample(1:5000, 15), 3] <- NA

m1 <- lm(y ~ x, data = petersen)
summary(m1)
## 
## Call:
## lm(formula = y ~ x, data = petersen)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.759 -1.371 -0.018  1.340  8.680 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02793    0.02842   0.983    0.326    
## x            1.03635    0.02865  36.175   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## Residual standard error: 2.007 on 4983 degrees of freedom
##   (15 observations deleted due to missingness)
## Multiple R-squared:  0.208,  Adjusted R-squared:  0.2078 
## F-statistic:  1309 on 1 and 4983 DF,  p-value: < 2.2e-16

coeftest(m1, vcov=function(x) cluster.vcov(x, petersen$firmid))
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.027932   0.067198  0.4157   0.6777    
## x           1.036354   0.050700 20.4407   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

对于使用 plm 的不同方法包见:

Double clustered standard errors for panel data

关于r - 包含 NA 的数据的聚集标准错误，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23313907/

文章推荐： php - 在 Wordpress 多站点中单点登录

文章推荐： C: 创建、传递和访问指向常量字符串的常量指针数组

文章推荐： azure - Windows Azure 上的 Jenkins 从属连接问题

文章推荐： f# - 发布外部包

r - 计算 R 中的 R 平方内、R 平方之间或整体 R 平方
我正在从 Stata 迁移到 R(plm 包)，以便进行面板模型计量经济学。在 Stata 中，面板模型(例如随机效应)通常报告组内、组间和整体 R 平方。 I have found plm 随机效应
r - Revolution R 中的模块是开源的。 R 许可证是否意味着我可以免费使用随附的 R 软件包？
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。想改进这个问题？将问题更新为 on-topic对于堆栈溢出。 6年前关闭。 Improve this qu
r - 如何从 R 中的用户输入中读取向量并将其用于进一步处理 [R]
我想要求用户输入整数值列表。用户可以输入单个值或一组多个值，如 1 2 3(spcae 或逗号分隔)然后使用输入的数据进行进一步计算。我正在使用下面的代码 EXP <- as.integer(rea
r - R 中的分类变量 - R 选择哪一个作为引用？
当 R 使用分类变量执行回归时，它实际上是虚拟编码。也就是说，省略了一个级别作为基础或引用，并且回归公式包括所有其他级别的虚拟变量。但是，R 选择了哪一个作为引用，以及我如何影响这个选择？具有四个级
r - 制作数据框或排名调整后的 R 平方 - R
这个问题基本上是我之前问过的问题的延伸:How to only print (adjusted) R-squared of regression model? 我想建立一个线性回归模型来预测具有 15
r - 将已安装的 R 包传输到另一台计算机上的 R
我在一台安装了多个软件包的 Linux 计算机上安装了 R。现在我正在另一台 Linux 计算机上设置 R。从他们的存储库安装 R 很容易，但我将不得不使用安装许多包 install.package
r - R 中字符的对象大小 - R 全局字符串池如何工作？
我正在阅读 Hadley 的高级 R 编程，当它讨论字符的内存大小时，它说: R has a global string pool. This means that each unique strin
r - 是否写入 "ui.R + server.R"或 "app.R"
我们可以将 Shiny 代码写在两个单独的文件中，"ui.R"和 "server.R" , 或者我们可以将两个模块写入一个文件 "app.R"并调用函数shinyApp() 这两种方法中的任何一种在性
r - 在 .R 文件中保存 R 对象(代码)(R 遗传编程)
我正在使用 R 通过 RGP 包进行遗传编程。环境创造了解决问题的功能。我想将这些函数保存在它们自己的 .R 源文件中。我这辈子都想不通怎么办。我尝试过的一种方法是: bf_str = print(b
r - 如何让 R 在编辑后自动加载我的 .r 文件？
假设我创建了一个函数“function.r”，在编辑该函数后我必须通过 source('function.r') 重新加载到我的全局环境中。无论如何，每次我进行编辑时，我是否可以避免将其重新加载到我的
r - 是否可以在命令行中将代码通过管道传递给 R 或 R 脚本？
例如，test.R 是一个单行文件: $ cat test.R # print('Hello, world!') 我们可以通过Rscript test.R 或R CMD BATCH test.R 来
r - 我可以使用 R 笔记本作为 R 包小插图吗？
我知道我可以使用 Rmd 来构建包插图，但想知道是否可以更具体地使用 R Notebooks 来制作包插图。如果是这样，我需要将 R Notebooks 编写为包小插图有什么不同吗？我正在使用最新版本
r - 在 R 运行时更新 R 包
我正在考虑使用 R 包的共享库进行 R 的站点安装。多台计算机将访问该库，以便每个人共享相同的设置。问题是我注意到有时您无法更新包，因为另一个 R 实例正在锁定库。我不能要求每个人都关闭它的 R
r - 如何从命令行向 R 提供表达式但阻止 R 立即退出？
我知道如何从命令行启动 R 并执行表达式(例如， R -e 'print("hello")' )或从文件中获取输入(例如， R -f filename.r )。但是，在这两种情况下，R 都会运行文件中
r - 从另一个 .r 文件中编辑 .r 文件
我正在尝试使我当前的项目可重现，因此我正在创建一个主文档(最终是一个 .rmd 文件)，用于调用和执行其他几个文档。这样我自己和其他调查员只需要打开和运行一个文件。当前设置分为三层:主文件、2 个读
r - 是否有任何简单的方法可以在 R 中制作不需要安装 R 的桌面应用程序
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。想改进这个问题？将问题更新为 on-topic对于堆栈溢出。 5年前关闭。 Improve this qu
r - 如何在 R 包的手册中包含 Authors@R？
我的 R 包中有以下描述文件 Package: blah Title: What the Package Does (one line, title case) Version: 0.0.0.9000
r - 将 R 代码转换为 R 风格
有没有办法更有效地编写以下语句？accel 是一个数据框。 accel[[2]]<- accel[[2]]-weighted.mean(accel[[2]]) accel[[3]]<- accel[[
r - 安装 R 包时，R 如何检查系统外部依赖项？
例如，在尝试安装 R 包时 curl作为 usethis 的依赖项: * installing *source* package ‘curl’ ... ** package ‘curl’ succes
r - 在 R 包中包含 R 脚本
我想将一些软件作为一个包共享，但我的一些脚本似乎并不能很自然地作为函数运行。例如，考虑以下代码块，其中“raw.df”是一个包含离散和连续类型变量的数据框。函数“count.unique”和“squa

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

r - 包含 NA 的数据的聚集标准错误