gpt4 book ai didi

r - 如何更改r中的异构双字母

转载 作者:行者123 更新时间:2023-12-02 14:12:01 25 4
gpt4 key购买 nike

我有一个数据框:

DF = read.table(text="S01   S02     S03    S04    S05   S06
TT CC TT CT TT 00
AC AA AC CC AA AA
CC TC CC TT CC 00
CC AC CC AC AA CC
GG 00 TG TT GG TG
GG GA GG GA GG GG", header=T, stringsAsFactors=F)

我想以更快的方式将所有异构值(双字母)更改为双“00”。预期结果:

S01   S02     S03    S04    S05   S06
TT CC TT 00 TT 00
00 AA 00 CC AA AA
CC 00 CC TT CC 00
CC 00 CC 00 AA CC
GG 00 00 TT GG 00
GG 00 GG 00 GG GG

感谢任何帮助!

最佳答案

我假设这是遗传数据。这使得构建所有异构碱基对变得容易,并使用正则表达式替换它们:

bases <-c("A","C","G","T")
b1 <- rep(bases, 4)
b2 <- rep(bases, each=4)
hetero<- paste0(b1[b1!=b2],b2[b2!=b1])

DF[] <- lapply(DF,gsub, pattern=paste0(hetero,collapse="|"),replacement="00")

或者

m <- as.matrix(DF)
m[m %in% hetero] <- "00"
res <- as.data.frame(m)

基准

因为基准测试很有趣,而且这个线程中有很多不同的解决方案。令人惊讶的结论:差异并不是很大,获胜者是 DavidH(紧随其后的康拉德)。

具有 1000 列和 1000 行的数据框的结果:

Unit: milliseconds
expr min lq mean median uq max neval cld
MrFlick 402.0281 477.4867 494.6892 484.5600 504.6442 592.0486 50 d
Heroka 227.1143 298.8655 333.7875 309.4572 375.5734 459.6164 50 c
Heroka2 696.2465 710.0094 733.5981 717.8195 775.4891 803.7156 50 e
DavidH 124.7802 127.9947 137.0511 130.3487 134.9696 210.5570 50 a
Konrad 144.0454 214.8844 231.9005 221.9659 291.3668 344.4238 50 b
Konrad2 699.5301 711.7724 750.1756 736.2112 787.4504 849.0606 50 e


#Data generated:

b1 <- rep(bases, 4)
b2 <- rep(bases, each=4)
all <- paste0(b1,b2)
largedat <- data.frame(matrix(sample(all,1000000,T),ncol=1000))

#benchmarking code

tests <- microbenchmark(
MrFlick = MrFlick(largedat),
Heroka = Heroka (largedat),
Heroka2= Heroka2(largedat),
DavidH=DavidH(largedat),
Konrad = Konrad(largedat),
Konrad2 = Konrad2(largedat),
times=50)
# Functions used:

MrFlick <- function(DF){
as.data.frame(gsub("^(.)(?!\\1).$","00", as.matrix(DF), perl=T))
}

Heroka <- function(DF){
bases <-c("A","C","G","T")
b1 <- rep(bases, 4)
b2 <- rep(bases, each=4)
hetero<- paste0(b1[b1!=b2],b2[b2!=b1])
m <- as.matrix(DF)
m[m %in% hetero] <- "00"
res <- as.data.frame(m)
res
}

Heroka2 <- function(DF){
DF[] <- lapply(DF,gsub, pattern=paste0(hetero,collapse="|"),replacement="00")
DF
}

DavidH <- function(DF){
ex <- expand.grid(c("A","T","C","G"),c("A","T","C","G"))
ex <- ex[ex[1]!=ex[2],]
het.combs <- apply(ex,1,function(i) {paste0(i[1],i[2])} )
map <- setNames( rep("00",length(het.combs)) , het.combs )
fac.df<- lapply(DF, as.factor)

fac.df <- lapply(fac.df, function(i){levels(i)[levels(i) %in% names(map)] <- map[levels(i)[levels(i) %in% names(map)]];i } )
DF <- as.data.frame(fac.df)
}

Konrad <- function(DF){
bases = c('A', 'C', 'G', 'T')
homozygous = apply(cbind(bases, bases), 1, paste, collapse = '')

DF = as.matrix(DF)
DF[! DF %in% homozygous] = '00'
DF
}

Konrad2 <-function(DF){
bases = c('A', 'C', 'G', 'T')
homozygous = apply(cbind(bases, bases), 1, paste, collapse = '')
DF = data.frame(lapply(DF, function (x) ifelse(x %in% homozygous, x, '00')))
}

关于r - 如何更改r中的异构双字母,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35656403/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com