gpt4 book ai didi

r - 如何在来自两个数据帧的分组值之间执行操作

转载 作者:行者123 更新时间:2023-12-04 10:33:55 24 4
gpt4 key购买 nike

我有两个数据框:


src_tbl <- structure(list(Sample_name = c("S1", "S2", "S1", "S2", "S1", 
"S2"), crt = c(0.079, 0.082, 0.079, 0.082, 0.079, 0.082), sr = c(0.592,
0.549, 0.592, 0.549, 0.592, 0.549), condition = c("x1", "x1",
"x2", "x2", "x3", "x3"), score = c("0.077", "0.075", "0.483",
"0.268", "0.555", "0.120")), row.names = c(NA, -6L), .Names = c("Sample_name",
"crt", "sr", "condition", "score"), class = c("tbl_df",
"tbl", "data.frame"))
src_tbl
#> Sample_name crt sr condition score
#> 1 S1 0.079 0.592 x1 0.077
#> 2 S2 0.082 0.549 x1 0.075
#> 3 S1 0.079 0.592 x2 0.483
#> 4 S2 0.082 0.549 x2 0.268
#> 5 S1 0.079 0.592 x3 0.555
#> 6 S2 0.082 0.549 x3 0.120

ref_tbl <- structure(list(Sample_name = c("P1", "P2", "P3", "P1", "P2",
"P3", "P1", "P2", "P3"), crt = c(1, 1, 1, 1, 1, 1, 1, 1, 1),
sr = c(2, 2, 2, 2, 2, 2, 2, 2, 2), condition = c("r1", "r1",
"r1", "r2", "r2", "r2", "r3", "r3", "r3"), score = c("0.200",
"0.201", "0.199", "0.200", "0.202", "0.200", "0.200", "0.204",
"0.197")), row.names = c(NA, -9L), .Names = c("Sample_name",
"crt", "sr", "condition", "score"), class = c("tbl_df",
"tbl", "data.frame"))
ref_tbl
#> Sample_name crt sr condition score
#> 1 P1 1 2 r1 0.200
#> 2 P2 1 2 r1 0.201
#> 3 P3 1 2 r1 0.199
#> 4 P1 1 2 r2 0.200
#> 5 P2 1 2 r2 0.202
#> 6 P3 1 2 r2 0.200
#> 7 P1 1 2 r3 0.200
#> 8 P2 1 2 r3 0.204
#> 9 P3 1 2 r3 0.197

我想要做的是在 ks.test() 上执行操作( score )按 Sample_name 分组的列在两个数据框中。例如,S1 和 P1 的 KS 检验的 p 值为:


# in src_tbl
s1 <- c(0.077,0.483,0.555)
#in ref_tbl
p1 <- c(0.200,0.200,0.200)
testout <- ks.test(s1,p1)
#> Warning in ks.test(s1, p1): cannot compute exact p-value with ties
broom::tidy(testout)
#> statistic p.value method alternative
#> 1 0.6666667 0.5175508 Two-sample Kolmogorov-Smirnov test two-sided

我想要做的是对所有操作执行所有操作,以便最终得到这样的表

src  ref   p.value
S1 P1 0.5175508
S1 P2 0.6
S1 P3 0.6
S2 P1 0.5175508
S2 P2 0.6
S2 P3 0.6

我怎样才能做到这一点?最好快,因为 ref_table 中的样本数可能很大(P1,P2 .... P10k)。

最佳答案

这是 tidyverse 中的解决方案.我首先将分数嵌套在每个源数据集中:

ref_tbl <- ref_tbl %>% 
mutate(ref = as.factor(Sample_name),
score_ref = as.numeric(score)) %>%
select(ref, score_ref) %>%
tidyr::nest(score_ref)

ref_tbl
# A tibble: 3 x 2
ref data
<fctr> <list>
1 P1 <tibble [3 x 1]>
2 P2 <tibble [3 x 1]>
3 P3 <tibble [3 x 1]>

src_tbl <- src_tbl %>%
mutate(src = as.factor(Sample_name),
score_src = as.numeric(score)) %>%
select(src, score_src) %>%
tidyr::nest(score_src)

src_tbl
# A tibble: 2 x 2
src data
<fctr> <list>
1 S1 <tibble [3 x 1]>
2 S2 <tibble [3 x 1]>

然后我创建一个包含所有样本名称组合的网格:
all_comb <- as_data_frame(expand.grid(src = src_tbl$src, ref = ref_tbl$ref))

all_comb
# A tibble: 6 x 2
src ref
<fctr> <fctr>
1 S1 P1
2 S2 P1
3 S1 P2
4 S2 P2
5 S1 P3
6 S2 P3

现在,我们可以加入嵌套数据,并且我绑定(bind)了列,因此对于每个组合,必须有一个带有分数的列表列。
all_comb <- all_comb %>% 
left_join(ref_tbl, by = "ref") %>%
left_join(src_tbl, by = "src") %>%
mutate(data = purrr::map2(data.x, data.y, bind_cols)) %>%
select(-data.x, -data.y)

all_comb
# A tibble: 6 x 3
src ref data
<fctr> <fctr> <list>
1 S1 P1 <tibble [3 x 2]>
2 S2 P1 <tibble [3 x 2]>
3 S1 P2 <tibble [3 x 2]>
4 S2 P2 <tibble [3 x 2]>
5 S1 P3 <tibble [3 x 2]>
6 S2 P3 <tibble [3 x 2]>

最后我图 ks.test如果每个数据集,使用 broom 按要求获取 p.value。
final <- all_comb %>%
mutate(ks = purrr::map(data, ~ks.test(.$score_ref, .$score_src)),
tidied = purrr::map(ks, broom::tidy)) %>%
tidyr::unnest(tidied) %>%
select(src, ref, p.value)
Warning message: cannot compute exact p-value with ties
Warning message: cannot compute exact p-value with ties

final
# A tibble: 6 x 3
src ref p.value
<fctr> <fctr> <dbl>
1 S1 P1 0.5175508
2 S2 P1 0.5175508
3 S1 P2 0.6000000
4 S2 P2 0.6000000
5 S1 P3 0.6000000
6 S2 P3 0.6000000

关于r - 如何在来自两个数据帧的分组值之间执行操作,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43772986/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com