gpt4 book ai didi

r - 在 R 中向量化复杂的 dplyr 语句

转载 作者:行者123 更新时间:2023-12-04 10:30:03 25 4
gpt4 key购买 nike

我试图计算出参加一门类(class)的学生人数,从那些能够参加类(class)的学生中算出,并非所有学校都提供计算机,不同的学校提供​​英语,能够参加计算机和英语的学生会有所不同。例如。使用下面的测试数据,我们将有:

df <- read.csv(text="school, student, course, result
URN1,stu1,comp,A
URN1,stu2,comp,B
URN1,stu3,comp,C
URN1,stu1,Eng,D
URN1,stu1,ICT,E
URN2,stu4,comp,A
URN1,stu1,ICT,B
URN2,stu5,comp,C
URN3,stu6,comp,D
URN3,stu6,ICT,E
URN4,stu7,Eng,E
URN4,stu8,ICT,E
URN4,stu8,Eng,E
URN5,stu9,comp,E
URN5,stu10,ICT,E")

[1] "comp taken by 58.3333333333333 % of possible students"

[1] "Eng taken by 33.3333333333333 % of possible students"

[1] "ICT taken by 38.4615384615385 % of possible students"



我有以下循环(嘘!)来做到这一点:
library(magrittr)
library(dplyr)

for(c in unique(df$course)){
# c <- "comp"
#get URNs of schools offering each course
URNs <- df %>% filter(course == c) %>% distinct(school) %$% school
#get number of students in each school offering course c
num_possible <- df %>% filter(school %in% URNs) %>% summarise(n = n()) %$% n
#get number of students taking course c
num_actual <- df %>% filter(course == c) %>% summarise(n = n()) %$% n

# get % of students taking course from those who could theoretically take c
print(paste(c, "taken by", (100 * num_actual/num_possible), "% of possible students"))
}

但是想要将其全部矢量化,但是,我无法将 num_possible 放入与 num_actual 相同的函数中:
df %>% group_by(course) %>% summarise(num_possible = somesubfunction(),
num_actual = n())

somesubfunction() 应该返回可能参加类(class) c 的学生人数

最佳答案

如果您热衷于尝试与 不同的东西, 你可以试试 :

library(data.table)

setDT(df)[, nb_stu:=.N, by=course] # how many students by course
df[, nb_stu_ec:=length(unique(student)), by=school] # how many students per school (!: Edited to avoid counting some students twice if they take multiple courses)

# finally compute the number of student for a course
# divided by the number of students in the schools that have this course (sprintf is only for formating the result):
df[, sprintf("%.2f", 100*first(nb_stu)/sum(nb_stu_ec[!duplicated(school)])), by=course]
# course V1
#1: comp 87.50
#2: Eng 60.00
#3: ICT 62.50

Nota Bene:如果仅在最后一步计算每门类(class)的学生人数,则可以少一步实现相同的目标:
setDT(df)[, nb_stu_ec:=length(unique(student)), by=school]
df[, sprintf("%.2f", 100*(.N)/sum(nb_stu_ec[!duplicated(school)])), by=course]

# course V1
#1: comp 87.50
#2: Eng 60.00
#3: ICT 62.50

关于r - 在 R 中向量化复杂的 dplyr 语句,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49010485/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com