gpt4 book ai didi

r - 在 r 中使用多个条件将控件与案例匹配

转载 作者:行者123 更新时间:2023-12-03 23:32:57 24 4
gpt4 key购买 nike

我要匹配 2 controlscase有两个条件:
age差异应在±2之间;
income差异应该在±2之间。
如果超过 2 个 controls对于一个案例,我只需要选择 2 controls随机。
有一个例子:
例子
数据

dat = structure(list(id = c(1, 2, 3, 4, 111, 222, 333, 444, 555, 666, 
777, 888, 999, 1000),
age = c(10, 20, 44, 11, 12, 11, 8, 12, 11, 22, 21, 18, 21, 18),
income = c(35, 72, 11, 35, 37, 36, 33, 70, 34, 74, 70, 44, 76, 70),
group = c("case", "case", "case", "case", "control", "control",
"control", "control", "control", "control", "control",
"control", "control", "control")),
row.names = c(NA, -14L), class = c("tbl_df", "tbl", "data.frame"))

> dat
# A tibble: 14 x 4
id age income group
<dbl> <dbl> <dbl> <chr>
1 1 10 35 case
2 2 20 72 case
3 3 44 11 case
4 4 11 35 case
5 111 12 37 control
6 222 11 36 control
7 333 8 33 control
8 444 12 70 control
9 555 11 34 control
10 666 22 74 control
11 777 21 70 control
12 888 18 44 control
13 999 21 76 control
14 1000 18 70 control
预期结果
对于 id = 1 ,匹配的控件如下,我只需要选择 2 controls随机在下表中。
|id|age|income|group|
|:----|:----|:----|:----|
|111|12|37|control|
|222|11|36|control|
|333|8|33|control|
|555|11|34|control|
对于 id = 2 ,匹配的控件如下,我只需要选择 2 controls随机在下表中。
|id|age|income|group|
|:----|:----|:----|:----|
|666|22|74|control|
|777|21|70|control|
|1000|18|70|control|
对于 id = 3 ,没有匹配的 controlsdat .
对于 id = 4 ,匹配的控件如下,我只需要选择 2 controls随机在下表中。

One thing to note here is that we can find that the controls for id = 1 and id = 4 have overlapping parts. I don't want two cases to share a control, what I need is that if id = 1 chooses id = 111 and id = 222 as control, then id = 4 can only choose id = 555 as control, and if id = 1 chooses id = 111 and id = 333 as control, then id = 4 can only choose id = 222 and id = 555 as controls.

|id|age|income|group|
|:----|:----|:----|:----|
|111|12|37|control|
|222|11|36|control|
|555|11|34|control|
最终的输出可能是这样的( id组中的 control是从满足条件的 id中随机抽取的):
|id|age|income|group|
|:----|:----|:----|:----|
|1|10|35|case|
|2|20|72|case|
|3|44|11|case|
|4|11|35|case|
|111|12|37|control|
|222|11|36|control|
|333|8|33|control|
|555|11|34|control|
|777|21|70|control|
|1000|18|70|control|
笔记
我查了一些网站,但它们不能满足我的需求。我不知道如何使用 R 代码实现我的要求。
任何帮助将不胜感激!
引用:
1.https://stackoverflow.com/questions/56026700/is-there-any-package-for-case-control-matching-individual-1n-matching-in-r-n
2. Case control matching in R (or spss), based on age, sex and ethnicity?
3. Matching case-controls in R using the ccoptimalmatch package
4. Exact Matching in R

最佳答案

根据修改后的要求,我提出以下 for loop

library(dplyr, warn.conflicts = F)

dat %>%
split(.$group) %>%
list2env(envir = .GlobalEnv)
#> <environment: R_GlobalEnv>

control$FILTER <- FALSE
control
#> # A tibble: 10 x 5
#> id age income group FILTER
#> <dbl> <dbl> <dbl> <chr> <lgl>
#> 1 111 12 37 control FALSE
#> 2 222 11 36 control FALSE
#> 3 333 8 33 control FALSE
#> 4 444 12 70 control FALSE
#> 5 555 11 34 control FALSE
#> 6 666 22 74 control FALSE
#> 7 777 21 70 control FALSE
#> 8 888 18 44 control FALSE
#> 9 999 21 76 control FALSE
#> 10 1000 18 70 control FALSE

set.seed(123)

for(i in seq_len(nrow(case))){
x <- which(between(control$age, case$age[i] -2, case$age[i] +2) &
between(control$income, case$income[i] -2, case$income[i] + 2) &
!control$FILTER)
control$FILTER[sample(x, min(2, length(x)))] <- TRUE
}

control
#> # A tibble: 10 x 5
#> id age income group FILTER
#> <dbl> <dbl> <dbl> <chr> <lgl>
#> 1 111 12 37 control TRUE
#> 2 222 11 36 control TRUE
#> 3 333 8 33 control TRUE
#> 4 444 12 70 control FALSE
#> 5 555 11 34 control TRUE
#> 6 666 22 74 control FALSE
#> 7 777 21 70 control TRUE
#> 8 888 18 44 control FALSE
#> 9 999 21 76 control FALSE
#> 10 1000 18 70 control TRUE

bind_rows(case, control) %>% filter(FILTER | is.na(FILTER)) %>% select(-FILTER)
#> # A tibble: 10 x 4
#> id age income group
#> <dbl> <dbl> <dbl> <chr>
#> 1 1 10 35 case
#> 2 2 20 72 case
#> 3 3 44 11 case
#> 4 4 11 35 case
#> 5 111 12 37 control
#> 6 222 11 36 control
#> 7 333 8 33 control
#> 8 555 11 34 control
#> 9 777 21 70 control
#> 10 1000 18 70 control
检查不同种子的结果
set.seed(234)
for(i in seq_len(nrow(case))){
x <- which(between(control$age, case$age[i] -2, case$age[i] +2) &
between(control$income, case$income[i] -2, case$income[i] + 2) &
!control$FILTER)
control$FILTER[sample(x, min(2, length(x)))] <- TRUE
}
control

bind_rows(case, control) %>% filter(FILTER | is.na(FILTER)) %>% select(-FILTER)

# A tibble: 10 x 4
id age income group
<dbl> <dbl> <dbl> <chr>
1 1 10 35 case
2 2 20 72 case
3 3 44 11 case
4 4 11 35 case
5 111 12 37 control
6 222 11 36 control
7 333 8 33 control
8 555 11 34 control
9 777 21 70 control
10 1000 18 70 control
dat在进行 id 3 之前修改
  • 将数据分成两组 casecontrol使用 baseR 的 `split
  • 使用 list2env 将两个保存为单独的 dfs
  • 使用 purrr::map_df您可以为每个案例抽取 2 行样本
  • 一次为age
  • 一次为 income

  • 最后从这些结果中的每一个中再次采样 2 行
  • bind_rows再次这些与 case还有

  • library(tidyverse)

    dat = structure(list(id = c(1, 2, 3, 111, 222, 333, 444, 555, 666, 777, 888, 999, 1000),
    age = c(10, 20, 44, 12, 11, 8, 12, 11, 22, 21, 18, 21, 18),
    income = c(35, 72, 11, 37, 36, 33, 70, 34, 74, 70, 44, 76, 70),
    group = c("case", "case", "case", "control", "control", "control",
    "control", "control", "control", "control", "control",
    "control", "control")),
    row.names = c(NA, -13L), class = c("tbl_df", "tbl", "data.frame"))

    dat
    #> # A tibble: 13 x 4
    #> id age income group
    #> <dbl> <dbl> <dbl> <chr>
    #> 1 1 10 35 case
    #> 2 2 20 72 case
    #> 3 3 44 11 case
    #> 4 111 12 37 control
    #> 5 222 11 36 control
    #> 6 333 8 33 control
    #> 7 444 12 70 control
    #> 8 555 11 34 control
    #> 9 666 22 74 control
    #> 10 777 21 70 control
    #> 11 888 18 44 control
    #> 12 999 21 76 control
    #> 13 1000 18 70 control

    dat %>%
    split(.$group) %>%
    list2env(envir = .GlobalEnv)
    #> <environment: R_GlobalEnv>

    set.seed(123)
    bind_rows(case, map_dfr(case$age, ~ control %>% filter(between(age, .x -2, .x +2) ) %>%
    sample_n(min(n(),2))) %>% sample_n(min(n(),2)),
    map_dfr(case$income, ~ control %>% filter(between(income, .x -2, .x +2)) %>%
    sample_n(min(n(),2))) %>% sample_n(min(n(),2)))
    #> # A tibble: 7 x 4
    #> id age income group
    #> <dbl> <dbl> <dbl> <chr>
    #> 1 1 10 35 case
    #> 2 2 20 72 case
    #> 3 3 44 11 case
    #> 4 222 11 36 control
    #> 5 777 21 70 control
    #> 6 111 12 37 control
    #> 7 333 8 33 control

    下面的代码也将在不保存单个 dfs 的情况下执行相同的操作
    dat %>%
    split(.$group) %>%
    {bind_rows(.$case,
    map_dfr(.$case$age, \(.x) .$control %>% filter(between(age, .x -2, .x +2) ) %>%
    sample_n(min(n(),2))) %>% sample_n(min(n(),2)),
    map_dfr(.$case$income, \(.x) .$control %>% filter(between(income, .x -2, .x +2)) %>%
    sample_n(min(n(),2))) %>% sample_n(min(n(),2)))}

    关于r - 在 r 中使用多个条件将控件与案例匹配,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68141082/

    24 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com