gpt4 book ai didi

r - 连接整个数据框的列对

转载 作者:行者123 更新时间:2023-12-04 09:19:38 25 4
gpt4 key购买 nike

我正在处理遗传数据,我需要连接成对的列。我拥有的数据在不同的列中有主要和次要等位基因(例如,等位基因 1a、等位基因 1b、等位基因 2a、等位基因 2b 等)。我需要一种方法来为整个数据框成对列。我在下面包含了一个示例,但我的数据有 170 万对(所以我现在有 340 万列),所以如果我需要为每一列命名,它将无法工作。稍后我将更改列名。如果在 R 中有一种方法可以做到这一点,我们将不胜感激。I have tried to create a sequence and paste them ,类似:

df <- data.frame(id = seq(1,20),
var1 = rep("A", 20),
var2 = c(rep("T", 10), rep("A", 10)),
var3 = rep("C", 20),
var4 = c(rep("C", 10), rep("G", 10)),
var5 = rep("A", 20),
var6 = c(rep("A", 10), rep("G", 10)),
stringsAsFactors = FALSE)

i <- seq.int(1, length(ped), by = 2L)
df <- paste0(df[i], df[i+1])

但这没有用。我希望它来自:

    id var1 var2 var3 var4 var5 var6
1 1 A T C C A A
2 2 A T C C A A
3 3 A T C C A A
4 4 A T C C A A
5 5 A T C C A A
6 6 A T C C A A
7 7 A T C C A A
8 8 A T C C A A
9 9 A T C C A A
10 10 A T C C A A
11 11 A A C G A G
12 12 A A C G A G
13 13 A A C G A G
14 14 A A C G A G
15 15 A A C G A G
16 16 A A C G A G
17 17 A A C G A G
18 18 A A C G A G
19 19 A A C G A G
20 20 A A C G A G

到:

   id var1 var2 var3
1 1 AT CC AA
2 2 AT CC AA
3 3 AT CC AA
4 4 AT CC AA
5 5 AT CC AA
6 6 AT CC AA
7 7 AT CC AA
8 8 AT CC AA
9 9 AT CC AA
10 10 AT CC AA
11 11 AA CG AG
12 12 AA CG AG
13 13 AA CG AG
14 14 AA CG AG
15 15 AA CG AG
16 16 AA CG AG
17 17 AA CG AG
18 18 AA CG AG
19 19 AA CG AG
20 20 AA CG AG

编辑:谢谢!!!我能够为我的数据调整两个答案,@akrun 跑得更快了。我创建了一个包含 100 行和 100,000 列的数据子集,结果如下:

microbenchmark(
+ {
+ new <- ped %>%
+ gather(key = V, value = value, -id) %>%
+ mutate(V = str_extract(V, "\\d+") %>% as.numeric()) %>%
+ group_by(id) %>%
+ mutate(pair = ceiling(V / 2)) %>%
+ group_by(id, pair) %>%
+ summarise(combined = paste(value, collapse = "")) %>%
+ mutate(V_combo = paste0("V", pair)) %>%
+ select(-pair) %>%
+ spread(key = V_combo, value = combined) %>%
+ select(id, paste0("V", seq(1, ncol(.)-1, 1)))
+ },
+ {
+ out <- ped[1]
+ new_cols <- paste0("V", seq(1, (ncol(ped)-1)/2))
+
+ out[new_cols] <- lapply(seq(2, ncol(ped)-1, 2),
+ function(i) do.call(paste0, ped[i:(i+1)]))
+ },
+ times = 1
+ )

Unit: seconds

expr min lq mean median uq max neval
camille 250.30901 250.30901 250.30901 250.30901 250.30901 250.30901 1
akrun 23.52434 23.52434 23.52434 23.52434 23.52434 23.52434 1
>
> new <- data.frame(new, stringsAsFactors = FALSE)
> identical(new, out)
[1] TRUE

最佳答案

我们可以创建一个循环来对列和相邻列进行子集化,paste 将其 withdo.call` 并将其作为新列分配给新数据集

out <- df[1]
out[paste0("var", 1:3)] <- lapply(seq(2, ncol(df), 2),
function(i) do.call(paste0, df[i:(i+1)]))

关于r - 连接整个数据框的列对,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53195260/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com