gpt4 book ai didi

r - 使用 dplyr 汇总多列 - 分类版本

转载 作者:行者123 更新时间:2023-12-04 12:49:14 28 4
gpt4 key购买 nike

正在关注 this questionthis one ,我想知道在一个数据集中总结分类变量的最佳选择是什么。

我有一个数据集,例如

# A tibble: 10 <U+00D7> 4
empstat_couple nssec7_couple3 nchild07 age_couple
<chr> <fctr> <fctr> <dbl>
1 Neo-Trad Lower Managerial 1child 39
2 Neo-Trad Higher Managerial 1child 31
3 Neo-Trad Manual and Routine 1child 33
4 Trad Higher Managerial 1child 43

前 3 个变量是分类(字符或因子),最后一个是数值变量。

我想要的是(输出)

                  var n   p
1: Neo-Trad 6 0.6
2: OtherArrangment 2 0.2
3: Trad 2 0.2
4: Higher Managerial 4 0.4
5: Lower Managerial 5 0.5
6: Manual and Routine 1 0.1
7: 1child 9 0.9
8: 2children 1 0.1

那么对于数值变量,我不确定如何将它有意义地添加到摘要中。

我想最基本的方法是

library(dplyr) 
library(data.table)

a = count(dt, empstat_couple) %>% mutate(p = n / sum(n))
b = count(dt, nssec7_couple3) %>% mutate(p = n / sum(n))
c = count(dt, nchild07) %>% mutate(p = n / sum(n))

rbindlist(list(a,b,c))

我想知道是否存在 summarise_each 解决方案?

这行不通

dt %>% summarise_each(funs(count))

使用 apply 我可以想出这个

apply(dt, 2, as.data.frame(table)) %>% rbindlist()

但效果不是很好。

有什么建议吗?

数据

dt = structure(list(empstat_couple = c("Neo-Trad", "Neo-Trad", "Neo-Trad", 
"Trad", "OtherArrangment", "Neo-Trad", "Trad", "OtherArrangment",
"Neo-Trad", "Neo-Trad"), nssec7_couple3 = structure(c(2L, 1L,
4L, 1L, 2L, 2L, 1L, 2L, 1L, 2L), .Label = c("Higher Managerial",
"Lower Managerial", "Intermediate", "Manual and Routine"), class = "factor"),
nchild07 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
1L), .Label = c("1child", "2children", ">2children"), class = "factor"),
age_couple = c(39, 31, 33, 43, 32, 28, 28, 40, 33, 26), hldid = 1:10), .Names = c("empstat_couple",
"nssec7_couple3", "nchild07", "age_couple", "hldid"), row.names = c(NA,
-10L), class = "data.frame")

最佳答案

我们可以用data.tablemelt得到.N和比例

library(data.table)
unique(melt(setDT(dt), id.var = "age_couple")[, n := .N , value],
by = c("variable", "value", "n"))[, p := n/sum(n), variable
][, c("age_couple", "variable" ) := NULL][]

或者使用dplyr/tidyr

library(dplyr)
library(tidyr)
gather(dt, var1, var, -age_couple) %>%
group_by(var) %>%
mutate(n = n()) %>%
select(-age_couple) %>%
unique() %>%
group_by(var1) %>%
mutate(p= n/sum(n)) %>%
ungroup() %>%
select(-var1)

关于r - 使用 dplyr 汇总多列 - 分类版本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41460866/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com