gpt4 book ai didi

r - 从 R 中的区间 [start, stop] 数据估计密度

转载 作者:行者123 更新时间:2023-12-05 07:01:49 25 4
gpt4 key购买 nike

描述

这个问题的动机来自临床/流行病学研究,其中研究通常招募患者,然后对他们进行不同时间的跟踪。

研究开始时的年龄分布通常很有趣并且很容易评估,但偶尔也会对研究期间任何时间的年龄分布感兴趣。

我的问题是,是否有一种方法可以根据区间数据(例如 [age_start, age_stop])估计这种密度,而无需如下扩展数据?长格式的方法看起来不够优雅,更不用说它的内存占用了!

使用生存包中的数据的可重现示例

#### Prep Data ###
library(survival)
library(ggplot2)
library(dplyr)

data(colon, package = 'survival')
# example using the colon dataset from the survival package
ccdeath <- colon %>%
# use data on time to death (not recurrence)
filter(etype == 2) %>%
# age at end of follow-up (death or censoring)
mutate(age_last = age + (time / 365.25))

#### Distribution Using Single Value ####
# age at study entry
ggplot(ccdeath, aes(x = age)) +
geom_density() +
labs(title = "Fig 1.",
x = "Age at Entry (years)",
y = "Density")

#### Using Person-Month Level Data ####
# create counting-process/person-time dataset
ccdeath_cp <- survSplit(Surv(age, age_last, status) ~ .,
data = ccdeath,
cut = seq(from = floor(min(ccdeath$age)),
to = ceiling(max(ccdeath$age_last)),
by = 1/12))

nrow(ccdeath_cp) # over 50,000 rows

# distribution of age at person-month level
ggplot(ccdeath_cp, aes(x = age)) +
geom_density() +
labs(title = "Figure 2: Density based on approximate person-months",
x = "Age (years)",
y = "Density")

#### Using Person-Day Level Data ####
# create counting-process/person-time dataset
ccdeath_cp <- survSplit(Surv(age, age_last, status) ~ .,
data = ccdeath,
cut = seq(from = floor(min(ccdeath$age)),
to = ceiling(max(ccdeath$age_last)),
by = 1/365.25))

nrow(ccdeath_cp) # over 1.5 million rows!

# distribution of age at person-month level
ggplot(ccdeath_cp, aes(x = age)) +
geom_density() +
labs(title = "Figure 3: Density based on person-days",
x = "Age (years)",
y = "Density")

Figure 1 Figure 2 Figure 3

注意:虽然我将这个问题标记为“生存”是因为我认为它会吸引熟悉该领域的人,但我对这里的事件发生时间不感兴趣,我只对所有研究时间的总体年龄分布感兴趣。

最佳答案

与其计算越来越精细的时间间隔,您可以只对特定年龄的患者数量进行累积计数

setDT(ccdeath)
x <- rbind(
ccdeath[, .(age = age, num_patients = 1)],
ccdeath[, .(age = age_last, num_patients = -1)]
)[, .(num_patients = sum(num_patients)), keyby = age]

cccdeath <- x[x[, .(age = unique(age))], on = 'age']
cccdeath[, num_patients := cumsum(num_patients)]
ggplot(cccdeath, aes(x = age, y = num_patients)) + geom_step()

enter image description here

锯齿模式是因为假定每个患者的起始年龄都是整数。对如何平滑它有一些想法并提出了这个想法 - 将相等的概率分配给给定 ageage+1 之间的一组均匀间隔的年龄。你得到这样的东西,

smooth_param <- 12
x <- rbindlist(lapply(
(1:smooth_param-0.5)/smooth_param,
function(s) {
rbind(
ccdeath[, .(age = age+s, num_patients = 1/smooth_param)],
ccdeath[, .(age = age_last+s, num_patients = -1/smooth_param)]
)
}
))[, .(num_patients = sum(num_patients)), keyby = age]

cccdeath <- x[x[, .(age = sort(unique(age)))], on = 'age']
cccdeath[, num_patients := cumsum(num_patients)]
ggplot(cccdeath, aes(x = age, y = num_patients)) + geom_step()

enter image description here

关于r - 从 R 中的区间 [start, stop] 数据估计密度,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63746044/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com