gpt4 book ai didi

为每一行 data.table 运行回归

转载 作者:行者123 更新时间:2023-12-04 10:58:40 28 4
gpt4 key购买 nike

我有一个大数据集,其中包含 K>1million 观察值和 2xN 列。 N 列是 X -s,其他 N 列是 Y -s。 N比较小,N<<K

我想对 Y_i = beta * X_i 形式的每个观察结果进行回归,其中 i=1,...N 。因此,我需要运行 K 回归并获得 K 测试版:beta_1,...,beta_K。我需要以或多或少高效的方式来做到这一点。

这是我能做的最原始的代码。

set.seed(10)
# Data looks like this
K = 15 # number of observations, in reality K is much larger say K = 100,000
# here implicitly N = 5, but it can vary, so if you can suggest a code that uses N as parameter it will be helpful.
data =
tibble(X1 = rnorm(K,0,1),
X2 = rnorm(K,0,1),
X3 = rnorm(K,0,1),
X4 = rnorm(K,0,1),
X5 = rnorm(K,0,1),
Y1 = rnorm(K,0,1),
Y2 = rnorm(K,0,1),
Y3 = rnorm(K,0,1),
Y4 = rnorm(K,0,1),
Y5 = rnorm(K,0,1)) %>%
as.data.table()
# Let's introduce some missed variables
data[3,5] = NA
data[5,2] = NA
data[7,4] = NA
data[7,5] = NA
# A dumb loop code will be
beta = 1:K
for(i in 1:K){
reg_data = tibble(x = data[i,1:5] %>% t(),
y = data[i,6:10] %>% t() )
colnames(reg_data) = c("x","y")
beta[i] = lm(y ~ x + 0, reg_data)$coefficients
}
data$beta = beta
data

数据:

             X1          X2          X3          X4         X5          Y1          Y2          Y3          Y4          Y5        beta
1: 0.01874617 0.08934727 -1.85374045 -0.02881534 -1.2375945 1.17270628 -0.41635467 0.56317466 -0.48136561 -0.98306919 0.03442037
2: -0.18425254 -0.95494386 -0.07794607 0.23252515 -0.4561763 -1.47982702 -0.19148234 0.66098669 0.20288178 0.49533171 0.18549600
3: -1.37133055 -0.19515038 0.96856634 -0.30120868 NA -0.43038782 0.06954478 -1.65805086 -0.03173974 0.72581750 -0.34597018
4: -0.59916772 0.92552126 0.18492596 -0.67761458 0.3401156 -1.05163864 1.15534832 1.02816798 -1.19558030 0.66729873 1.60396614
5: 0.29454513 NA -1.37994358 0.65522764 1.0663764 1.52258634 0.59495735 1.12795361 0.62368124 0.95478644 0.08960632
6: 0.38979430 -0.59631064 -1.43551436 -0.40063755 1.2161258 0.59282805 -1.41964511 -1.28015460 -0.91480448 -1.67533218 0.29574377
7: -1.20807618 -2.18528684 0.36208723 NA NA -0.22266151 -1.60667725 1.12886823 0.24875801 -1.20518539 0.65799077
8: -0.36367602 -0.67486594 -1.75908675 1.36795395 -0.4812086 0.71289428 0.89292590 -0.46413453 -1.06262279 -1.96325249 -0.09581759
9: -1.62667268 -2.11906119 -0.32454401 2.13776710 0.5627448 0.71660083 0.14816796 -0.31576021 -0.36398225 1.47075231 -0.10946286
10: -0.25647839 -1.26519802 -0.65156299 0.50581926 -1.2463197 0.44024186 1.22702839 0.92429315 -1.20699485 0.37247234 -0.85696018
11: 1.10177950 -0.37366156 1.08655140 0.78634238 0.3809222 0.15883062 -0.76180434 0.07714472 1.42921278 1.06587933 0.62874846
12: 0.75578151 -0.68755543 -0.76254488 -0.90221194 -1.4304273 0.65976414 0.41937541 1.03992361 0.63343589 0.53064987 -0.42653775
13: -0.23823356 -0.87215883 -0.82866254 0.53289699 -1.0484455 2.22051966 -1.03994336 0.74188621 -1.99681562 0.10198345 -0.48758134
14: 0.98744470 -0.10176101 0.83447390 -0.64589425 -0.2185036 -1.18394507 0.71157397 1.25554486 -0.68183217 1.33778247 -0.02128415
15: 0.74139013 -0.25378053 -0.96765199 0.29098749 -1.4899362 -0.07395583 -0.63321301 0.95091897 -0.46005548 0.08723477 -0.27967233

.

执行此操作的有效数据表方式是什么?

最佳答案

这里有一个选项可以将其转换为长格式,然后使用找到的方程式计算线性回归中的系数 here

DT <- melt(data[, rn := .I], id.vars="rn", measure.vars=patterns(c("X", "Y")),
na.rm=TRUE, value.name=c("X","Y"))[order(rn)]

DT[, {
sumx <- sum(X)
sumy <- sum(Y)
sumxsq <- sum(X^2)
sumxy <- sum(X*Y)
b <- (sumxy - sumx * sumy / .N) / (sumxsq - sumx^2 / .N)
.(`(Intercept)`=sumy / .N - b * sumx / .N, X=b)
},
rn]

#for comparison
#DT[, as.list(lm(Y ~ X, .SD)$coefficients), rn]

输出:

    rn (Intercept)           X
1: 1 -0.01297769 0.02656658
2: 2 -0.01363072 0.16932027
3: 3 -0.63389115 -0.53933759
4: 4 0.06518737 1.59775760
5: 5 1.07353565 -0.10238037
6: 6 -0.92042107 0.11494024
7: 7 0.83134259 1.05384614
8: 8 -0.47319334 -0.25212270
9: 9 0.31078353 -0.07436378
10: 10 -0.26208273 -1.05275512
11: 11 0.04097118 0.59169896
12: 12 0.67359486 0.02802331
13: 13 -0.40133349 -0.82876842
14: 14 0.31281032 -0.14598437
15: 15 -0.14023767 -0.34075091

关于为每一行 data.table 运行回归,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58740034/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com