gpt4 book ai didi

r - 估计 r 中面板数据的组合?

转载 作者:行者123 更新时间:2023-12-02 10:08:43 27 4
gpt4 key购买 nike

我正在尝试最大化横截面面板数据中的数据点数量。我的矩阵结构如下,y 轴为年份,x 轴为国家/地区:

        A     B    C     D 
2000 NA 50 NA 85
2001 110 75 76 86
2002 120 NA 78 87
2003 130 100 80 88

因此,我试图找到年度数据点的所有可能组合,以获得每个组合最多的国家/地区。使用上面的示例,我尝试生成向量、列表或其他类型的对象,类似于以下内容:

2000, 2001, 2002, 2003 = D
2000, 2001, 2003 = D and B
2001, 2002, 2003 = D, A and C
2000, 2001 = D and B
2001, 2002 = D, A and C
2002, 2003 = D, A and C
2000 = D and B
2001 = A, B, C and D
2002 = A, C and D
2003 = A, B, C and D

这是一件抽象的事情,我无法理解它。我将不胜感激任何帮助。

最佳答案

更新

这是一个很好的起点,但可能还可以改进的解决方案:

library(RcppAlgos)
getCombs <- function(myMat, myCap = NULL, minYears = NULL) {

numRows <- nrow(myMat)
myColNames <- colnames(myMat)

if (is.null(minYears)) ## set default
repZero <- numRows - 1
else if (minYears >= numRows || minYears < 1) ## check for extreme cases
repZero <- numRows - 1
else
repZero <- numRows - minYears

combs <- comboGeneral(v = c(0,1:numRows),
m = numRows, freqs = c(repZero,
rep(1, numRows)), rowCap = myCap)

## I think this part could be improved
out <- lapply(1:nrow(combs), function(x) {
myRows <- myMat[combs[x,],]

if (is.null(nrow(myRows)))
result <- !is.na(myRows)
else
result <- complete.cases(t(myRows))

myColNames[result]
})

myRowNames <- rownames(myMat)
names(out) <- lapply(1:nrow(combs), function(x) myRowNames[combs[x,combs[x,]>0]])
out
}

这是 OP 示例的输出。 (OP 缺少以下 5 个结果):

testMat <- matrix(c(NA, 50, NA, 85, 110, 75, 76, 86, 120, NA, 78, 87, 130, 100, 80, 88), nrow = 4, byrow = TRUE)
row.names(testMat) <- 2000:2003
colnames(testMat) <- LETTERS[1:4]

getCombs(testMat)
$`2000`
[1] "B" "D"

$`2001`
[1] "A" "B" "C" "D"

$`2002`
[1] "A" "C" "D"

$`2003`
[1] "A" "B" "C" "D"

$`c(2000, 2001)`
[1] "B" "D"

$`c(2000, 2002)`
[1] "D"

$`c(2000, 2003)`
[1] "B" "D"

$`c(2001, 2002)`
[1] "A" "C" "D"

$`c(2001, 2003)`
[1] "A" "B" "C" "D"

$`c(2002, 2003)`
[1] "A" "C" "D"

$`c(2000, 2001, 2002)`
[1] "D"

$`c(2000, 2001, 2003)`
[1] "B" "D"

$`c(2000, 2002, 2003)`
[1] "D"

$`c(2001, 2002, 2003)`
[1] "A" "C" "D"

$`c(2000, 2001, 2002, 2003)`
[1] "D"

但是,这个答案或任何 future 的答案不会为您提供所有组合,因为您拥有 144 个国家/地区和 47 年的数据。这会产生一个非常非常的数字。任何长度不超过 n 的每个组合都相当于 power set 。幂集中的元素数量仅为2^n。由于我们没有计算空集的等价物,因此我们需要减一,因此:

library(gmp)
sub.bigz(pow.bigz(2, 47),1)
Big Integer ('bigz') :
[1] 140737488355327

是的,超过一百万亿!!!您可能需要重新考虑您的方法,因为结果太多。

一切都还没有丢失!您可以使用 myCap 参数来限制结果的数量,以便您仍然可以研究可能的组合。观察:

set.seed(11111)
biggerTest <- matrix(sample(100, 20*20, replace = TRUE), nrow = 20)

library(countrycode)
colnames(biggerTest) <- LETTERS[1:20]
rownames(biggerTest) <- 1988:2007

## set 10% of values to NA
myNAs <- sample(400, 400 / 10)
biggerTest[myNAs] <- NA

biggerTest[1:6, 1:10]
A B C D E F G H I J
1988 51 71 79 35 22 33 22 84 68 4
1989 NA 51 73 10 48 NA 62 44 29 60
1990 NA 21 NA 44 91 24 45 62 52 18
1991 91 91 58 79 65 34 36 87 54 32
1992 82 6 74 75 99 NA 20 28 64 30
1993 80 10 43 100 24 22 99 28 22 44

## Getting all 1,048,575 results takes a good bit of time
system.time(allResults <- getCombs(biggerTest))
user system elapsed
49.449 0.726 50.191

## Using myCap greatly reduces the amount of time
system.time(smallSampTest <- getCombs(biggerTest, myCap = 10000))
user system elapsed
0.252 0.003 0.257

或者,您可以使用 minYears 参数仅返回具有最少年份组合数的结果。例如,根据 OP 对 @CPak 答案的评论,如果您只想查看 15 年或以上组合的结果,我们有:

system.time(minYearTest <- getCombs(biggerTest, minYears = 15))
user system elapsed
1.408 0.018 1.428

set.seed(123)
minYearTest[sample(length(minYearTest), 5)]
$`c(1988, 1989, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2001, 2004, 2005, 2007)`
[1] "C" "E" "G" "T"

$`c(1988, 1989, 1990, 1991, 1993, 1994, 1996, 1997, 1998, 1999, 2000, 2002, 2003, 2004, 2005, 2007)`
[1] "G" "I" "T"

$`c(1988, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1999, 2000, 2001, 2003, 2004, 2005, 2007)`
[1] "D" "G" "K" "M" "T"

$`c(1988, 1990, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 2000, 2002, 2003, 2004, 2005, 2006, 2007)`
[1] "G" "J" "K" "T"

$`c(1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2003, 2004, 2005, 2006, 2007)`
[1] "E" "G" "T"

或者同时使用两个参数:

system.time(bothConstraintsTest <- getCombs(biggerTest, 10000, minYears = 10))
user system elapsed
0.487 0.004 0.494

bothConstraintsTest[1:5]
$`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1997")`
[1] "E" "G" "H" "J" "M" "R" "T"

$`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1998")`
[1] "E" "G" "H" "J" "T"

$`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1999")`
[1] "D" "E" "G" "M" "T"

$`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "2000")`
[1] "D" "G" "J" "M" "R" "T"

$`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "2001")`
[1] "D" "E" "G" "H" "J" "M" "R" "T"


说明

我们需要做的第一件事是确定n年的每一个组合。这归结为查找 multiset 的所有 n 元组。 c(rep(0, n-1), 1:n) 或等价地,n 元素集的幂集减去空集。例如,对于 2000:2003 年(4 年跨度),可能的组合由下式给出:

    comboGeneral(v = c(0,1:4), m = 4,
freqs = c(3, rep(1, 4)))
[,1] [,2] [,3] [,4]
[1,] 0 0 0 1
[2,] 0 0 0 2
[3,] 0 0 0 3
[4,] 0 0 0 4
[5,] 0 0 1 2
[6,] 0 0 1 3
[7,] 0 0 1 4
[8,] 0 0 2 3
[9,] 0 0 2 4
[10,] 0 0 3 4
[11,] 0 1 2 3
[12,] 0 1 2 4
[13,] 0 1 3 4
[14,] 0 2 3 4
[15,] 1 2 3 4

现在,我们迭代组合的每一行,其中每一行告诉我们原始矩阵中的哪些行组合要测试 NA。如果特定组合仅包含一个结果,我们将确定哪些索引不NA。这可以通过 !is.na( 轻松实现。如果我们有多于一行,我们使用 complete.cases(t 来获取仅包含数字的列 (即没有出现 NA)。

此后,我们只需使用索引来获取结果的名称,瞧,我们就得到了想要的结果。

关于r - 估计 r 中面板数据的组合?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49200510/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com