gpt4 book ai didi

r - 根据定义的连续观察次数,在 R 中有条件地删除面板数据中的变量

转载 作者:行者123 更新时间:2023-12-04 10:32:27 25 4
gpt4 key购买 nike

我是R的新手,我的问题如下:

我有一组按时间序列组织的面板数据,如下所示(仅显示部分):

Week_Starting    Team A            Team B      Team C   Team D              
2010-01-02 1 2 3 4
2010-01-09 2 40 1 5
2010-01-16 15 <NA> 4 11
2010-01-23 25 <NA> 7 18
2010-01-30 38 <NA> 9 29
2010-02-06 <NA> <NA> 12 34
2010-02-13 <NA> <NA> 16 40
2010-02-20 <NA> <NA> 20 <NA>
2010-02-27 <NA> <NA> 15 28
2010-03-06 <NA> <NA> 20 <NA>
2010-03-13 <NA> <NA> 24 <NA>
2010-03-20 <NA> <NA> 24 <NA>
2010-03-27 <NA> <NA> 21 <NA>
2010-04-03 <NA> <NA> 27 <NA>
2010-04-10 <NA> <NA> 24 <NA>
2010-04-17 <NA> <NA> 25 <NA>
2010-04-24 <NA> <NA> 35 <NA>
2010-05-01 <NA> <NA> 40 <NA>
2010-05-08 <NA> <NA> 32 <NA>
2010-05-15 <NA> <NA> <NA> <NA>
2010-05-22 <NA> <NA> 39 <NA>

例如,由于缺少太多观察结果,因此使用 B 队将毫无意义。排名系统不提供低于 40 的排名数据。所以我想通过删除没有至少 8 周连续观察的列(变量)来清理(例如本例中的团队 A、B 和 D)。所以 D 不符合要求,因为从 2010-02-20 开始的一周有间隔。请记住,我有超过 1000 列。

我试过了 this之前但它没有给我我想要的东西,不幸的是我没有足够的技能来修改代码以满足我的需要。

我能想到的一些可能的解决方案:

  1. 子集化每个变量中具有 8 个或更多连续观测值的部分

  2. 如果连续运行 8 个观测值包含一个 NA,则设置观测值 = NA,然后删除只有 NA 的列,因为不满足最少 8 周要求的列将只有 NA 值(我希望你得到什么我是说)

只是出于兴趣,如果数据以长格式组织,做同样的事情会不会更难?

#Using MrFlick's data frame

melt(dd,id="Week_Starting")

Week_Starting variable value
1 2010-01-02 Team_A 1
2 2010-01-09 Team_A 2
3 2010-01-16 Team_A 15
4 2010-01-23 Team_A 25
5 2010-01-30 Team_A 38
6 2010-02-06 Team_A NA
7 2010-02-13 Team_A NA
8 2010-02-20 Team_A NA
9 2010-02-27 Team_A NA
10 2010-03-06 Team_A NA
11 2010-03-13 Team_A NA
12 2010-03-20 Team_A NA
13 2010-03-27 Team_A NA
14 2010-04-03 Team_A NA
15 2010-04-10 Team_A NA
16 2010-04-17 Team_A NA
17 2010-04-24 Team_A NA
18 2010-05-01 Team_A NA
19 2010-05-08 Team_A NA
20 2010-05-15 Team_A NA
21 2010-05-22 Team_A NA
22 2010-01-02 Team_B 2
23 2010-01-09 Team_B 40
24 2010-01-16 Team_B NA
25 2010-01-23 Team_B NA
26 2010-01-30 Team_B NA
27 2010-02-06 Team_B NA
28 2010-02-13 Team_B NA
29 2010-02-20 Team_B NA
30 2010-02-27 Team_B NA
31 2010-03-06 Team_B NA
32 2010-03-13 Team_B NA
33 2010-03-20 Team_B NA
34 2010-03-27 Team_B NA
35 2010-04-03 Team_B NA
36 2010-04-10 Team_B NA
37 2010-04-17 Team_B NA
38 2010-04-24 Team_B NA
39 2010-05-01 Team_B NA
40 2010-05-08 Team_B NA
41 2010-05-15 Team_B NA
42 2010-05-22 Team_B NA
43 2010-01-02 Team_C 3
44 2010-01-09 Team_C 1
45 2010-01-16 Team_C 4
46 2010-01-23 Team_C 7
47 2010-01-30 Team_C 9
48 2010-02-06 Team_C 12
49 2010-02-13 Team_C 16
50 2010-02-20 Team_C 20
51 2010-02-27 Team_C 15
52 2010-03-06 Team_C 20
53 2010-03-13 Team_C 24
54 2010-03-20 Team_C 24
55 2010-03-27 Team_C 21
56 2010-04-03 Team_C 27
57 2010-04-10 Team_C 24
58 2010-04-17 Team_C 25
59 2010-04-24 Team_C 35
60 2010-05-01 Team_C 40
61 2010-05-08 Team_C 32
62 2010-05-15 Team_C NA
63 2010-05-22 Team_C 39
64 2010-01-02 Team_D 4
65 2010-01-09 Team_D 5
66 2010-01-16 Team_D 11
67 2010-01-23 Team_D 18
68 2010-01-30 Team_D 29
69 2010-02-06 Team_D 34
70 2010-02-13 Team_D 40
71 2010-02-20 Team_D NA
72 2010-02-27 Team_D 28
73 2010-03-06 Team_D NA
74 2010-03-13 Team_D NA
75 2010-03-20 Team_D NA
76 2010-03-27 Team_D NA
77 2010-04-03 Team_D NA
78 2010-04-10 Team_D NA
79 2010-04-17 Team_D NA
80 2010-04-24 Team_D NA
81 2010-05-01 Team_D NA
82 2010-05-08 Team_D NA
83 2010-05-15 Team_D NA
84 2010-05-22 Team_D NA

有什么建议吗?

最佳答案

您可以使用 rle 来计算非 NA 值的运行长度。首先,这是一个很好的 data.frame,您可以复制/粘贴您的数据。

dd<-structure(list(Week_Starting = structure(1:21, .Label = c("2010-01-02", 
"2010-01-09", "2010-01-16", "2010-01-23", "2010-01-30", "2010-02-06",
"2010-02-13", "2010-02-20", "2010-02-27", "2010-03-06", "2010-03-13",
"2010-03-20", "2010-03-27", "2010-04-03", "2010-04-10", "2010-04-17",
"2010-04-24", "2010-05-01", "2010-05-08", "2010-05-15", "2010-05-22"
), class = "factor"), Team_A = c(1L, 2L, 15L, 25L, 38L, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Team_B = c(2L,
40L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA), Team_C = c(3L, 1L, 4L, 7L, 9L, 12L, 16L,
20L, 15L, 20L, 24L, 24L, 21L, 27L, 24L, 25L, 35L, 40L, 32L, NA,
39L), Team_D = c(4L, 5L, 11L, 18L, 29L, 34L, 40L, NA, 28L, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("Week_Starting",
"Team_A", "Team_B", "Team_C", "Team_D"), class = "data.frame", row.names = c(NA,
-21L))

现在我们定义一个函数,可以计算向量中非 NA 值的最长运行

consecnonNA <- function(x) {
rr<-rle(is.na(x))
max(rr$lengths[rr$values==FALSE])
}

我们可以为每一列计算这个值,并返回至少连续 8 周的那些列的名称

atleast <- function(i) {function(x) x>=i}
hasatleast8 <- names(Filter(atleast(8), sapply(dd[,-1], consecnonNA)))

然后我们可以用

进行子集化
dd[, c("Week_Starting", hasatleast8), drop=F]

关于r - 根据定义的连续观察次数,在 R 中有条件地删除面板数据中的变量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24600716/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com