gpt4 book ai didi

r - 按ID的行数

转载 作者:行者123 更新时间:2023-12-04 04:45:00 25 4
gpt4 key购买 nike

数据集包含三个变量:id,性别和等级(因子)。

mydata <- data.frame(id=c(1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4), sex=c(1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1),
grade=c("a","b","c","d","e", "x","y","y","x", "q","q","q","q", "a", "a", "a", NA, "b"))

对于每个ID,我需要查看我们有多少个独特的成绩,然后创建一个新列(称为N)来记录成绩频率。例如,对于ID = 1,我们有五个“等级”的唯一值,因此N = 4;对于ID = 2,“等级”有两个唯一值,因此N = 2;对于ID = 4,我们有两个唯一的“等级”值(忽略NA),所以N = 2。

最终数据集是
mydata <- data.frame(id=c(1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4), sex=c(1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1),
grade=c("a","b","c","d","e", "x","y","y","x", "q","q","q","q", "a", "a", "a", NA, "b"))
mydata$N <- c(5,5,5,5,5,2,2,2,2,1,1,1,1,2,2,2,2,2)

最佳答案

您可以使用data.table包:

library(data.table)
setDT(mydata)

#I have removed NA's, up to you how to count them
mydata[,N_u:=length(unique(grade[!is.na(grade)])),by=id]

非常简短,可读且快速。也可以在base-R中完成:
#lapply(split(grade,id),...: splits data into subsets by id
#unlist: creates one vector out of multiple vectors
#rep: makes sure each ID is repeated enough times

mydata$N <- unlist(lapply(split(mydata$grade,mydata$id),function(x){
rep(length(unique(x[!is.na(x)])),length(x))
}
))

由于讨论了什么是更快的,让我们做一些基准测试。

给定数据集:
> test1
Unit: milliseconds
expr min lq mean median uq max neval cld
length_unique 3.043186 3.161732 3.422327 3.286436 3.477854 10.627030 100 b
uniqueN 2.481761 2.615190 2.763192 2.738354 2.872809 3.985393 100 a

较大的数据集:(10000个观测值,1000个ID)
> test2
Unit: milliseconds
expr min lq mean median uq max neval cld
length_unique 11.84123 24.47122 37.09234 30.34923 47.55632 97.63648 100 a
uniqueN 25.83680 50.70009 73.78757 62.33655 97.33934 210.97743 100 b

关于r - 按ID的行数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34007199/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com