gpt4 book ai didi

r - 如何在 r 的新列中拆分逗号

转载 作者:行者123 更新时间:2023-12-02 07:30:11 25 4
gpt4 key购买 nike

我有这个数据

CHOM POS REF ALT
1 121 A AA,AT
2 254 GCGC GCGCG,AGCG
3 214 C T

我需要将 ALT 列拆分为

CHOM POS REF       ALT        ALT1    ALT2 ...
1 121 A AA AT 0
2 254 GCGC GCGCG AGCG 0
3 214 C T 0 0

我试过了但是错误是

alt=x$ALT
strsplit(alt, ",")

注意:ALT和REF有很多种,按逗号的列最大为4。如果有逗号就把值 0 或 NA

最佳答案

新答案

我会编写如下函数来拆分列:

splitFun <- function(inVec, sep = ",", newName = "ALT", fill = NA) {
if (!is.character(inVec)) inVec <- as.character(inVec)
X <- strsplit(inVec, sep, fixed = TRUE)
cols <- vapply(X, length, 1L)
M <- matrix(
fill, nrow = length(inVec), ncol = max(cols),
dimnames = list(NULL, make.unique(rep(newName, max(cols)), sep="")))
M[cbind(rep(sequence(length(X)), cols), sequence(cols))] <-
unlist(X, use.names=FALSE)
M
}

用法很简单:

splitFun(mydf$ALT)  ## Modify default arguments accordingly
# ALT ALT1 ALT2
# [1,] "AA" "AT" NA
# [2,] "GCGCG" "AGCG" NA
# [3,] "GCGCG" "AT" "AA"
cbind(mydf, splitFun(mydf$ALT))
# CHOM POS REF ALT ALT ALT1 ALT2
# 1 1 121 A AA,AT AA AT <NA>
# 2 2 254 GCGC GCGCG,AGCG GCGCG AGCG <NA>
# 3 1 123 GCGC GCGCG,AT,AA GCGCG AT AA

时机应该非常有效。这是与“splitstackshape”方法(也可以处理不平衡情况)的比较。

system.time(splitstackshape:::read.concat(
bigDf$ALT, sep=",", col.prefix="ALT"))
# user system elapsed
# 1.197 0.000 1.202
system.time(splitFun(bigDf$ALT))
# user system elapsed
# 0.069 0.000 0.068

对于上述内容,使用的样本数据是:

mydf <- data.frame(CHOM = c(1, 2, 1), POS = c(121, 254, 123), 
REF = c("A", "GCGC", "GCGC"),
ALT = c("AA,AT", "GCGCG,AGCG", "GCGCG,AT,AA"))
mydf
# CHOM POS REF ALT
# 1 1 121 A AA,AT
# 2 2 254 GCGC GCGCG,AGCG
# 3 1 123 GCGC GCGCG,AT,AA

bigDf <- do.call(rbind, replicate(10000, mydf, simplify = FALSE))

旧答案

你可以试试我的“splitstackshape”包中的concat.split:

library(splitstackshape)
concat.split(mydf, "ALT", ",") ## Add `drop = TRUE` to drop the original column
# CHOM POS REF ALT ALT_1 ALT_2
# 1 1 121 A AA,AT AA AT
# 2 2 254 GCGC GCGCG,AGCG GCGCG AGCG

“reshape2”包中还有 colsplit:

library(reshape2)
colsplit(as.character(mydf$ALT), ",", c("ALT", "ALT1"))
# ALT ALT1
# 1 AA AT
# 2 GCGCG AGCG

您可以使用 cbind 将输出添加到您的原始数据集。

关于r - 如何在 r 的新列中拆分逗号,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22747759/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com