gpt4 book ai didi

r - 改进和加速代码以确定大量组合

转载 作者:行者123 更新时间:2023-12-04 08:04:57 24 4
gpt4 key购买 nike

好的,我将描述真实数据而不是 reprex,因为我认为这不会让它变得更容易,但为了澄清这一切,这个问题涉及一个微小的生物化学 101。
我使用 DNA 诱变文库,其中某些 DNA 位置是随机的,这导致蛋白质也具有随机的氨基酸位置。 DNA 由核苷酸(其中有四个,GATC)和一个氨基酸(其中有 20 个,用字母表示)组成,由一组三个核苷酸(“密码子”)编码。
我有两个描述密码子 - 氨基酸关系的载体:

cods <- c("GCT","GCC","GCA","GCG","CGT","CGC","CGA","CGG","AGA","AGG","AAT","AAC","GAT","GAC","TGT","TGC","CAA","CAG","GAA","GAG","GGT","GGC","GGA","GGG","CAT","CAC","TAA","TAG","TGA","ATT","ATC","ATA","CTT","CTC","CTA","CTG","TTA","TTG","AAA","AAG","ATG","TTT","TTC","CCT","CCC","CCA","CCG","TCT","TCC","TCA","TCG","AGT","AGC","ACT","ACC","ACA","ACG","TGG","TAT","TAC","GTT","GTC","GTA","GTG")
aas <- c("A","A","A","A","R","R","R","R","R","R","N","N","D","D","C","C","Q","Q","E","E","G","G","G","G","H","H","*","*","*","I","I","I","L","L","L","L","L","L","K","K","M","F","F","P","P","P","P","S","S","S","S","S","S","T","T","T","T","W","Y","Y","V","V","V","V")
随机位置允许密码子中某个位置的某些核苷酸,并由特定(不相关)字母表示,因此,例如核苷酸密码子“NYS”允许第一个位置的所有四个核苷酸(GATC),但只有 AG位置二,AC 位置三。我现在创建了 NYS 和另一个图书馆的所有可能的三元组,如下所示:
NYS <- expand.grid(list(c("A","C","G", "T"), c("C","T"), c("C","G")))
VRM <- expand.grid(list(c("A","C","G"), c("A","G"), c("A","C")))
然后我计算所有这些组合的相应氨基酸:
# make codon triplet strings
NYS[,"cods"] <- paste(NYS$Var1, NYS$Var2, NYS$Var3, sep='')
VRM[,"cods"] <- paste(VRM$Var1, VRM$Var2, VRM$Var3, sep='')

#look them up in the aa vector and add a column
NYS[,"aas"] <- aas[match(NYS$cods, cods)]
VRM[,"aas"] <- aas[match(VRM$cods, cods)]

#get only the relevant columns
VRM <- VRM %>% select("aas", "cods")
NYS <- NYS %>% select("aas", "cods")
NYS$cods <- "NYS"
VRM$cods <- "VRM"
现在是棘手的部分:根据某个输入向量,描述随 secret 码子的数量和类型,例如 library_cods <- c("NYS", "VRM", "NYS", "NYS", "VRM", "VRM")我现在想计算这些文库中可能出现的所有氨基酸序列。然后我想创建一个包含所有唯一序列和出现次数的数据框。我这样做:
# make a string that contains n sort()s of the columns as determined by library_cods, evaluate, expand
all_combos <- expand.grid(lapply(str_split(paste(gsub("(...)", "sort(\\1\\$aas)", library_cods), collapse = ","), ",", simplify = T), function(x) eval(parse(text=x))))

# get the string from the rows
unique_seqs <- apply(all_combos, 1, function(x) paste(x, collapse = ""))

#rle() to count
unique_seqs <- data.frame(unclass(rle(sort(unique_seqs))))

#sort by count
unique_seqs <- unique_seqs[order(unique_seqs$lengths, decreasing = T),]
这一切正常,但是,问题是它真的很慢。所以我的主要问题是,我怎样才能让它更快?
在我的系统上,执行 rle() 和之后的两行需要 70 秒。这与 sort -n file | uniq -c | sort -n 相比在 bash 中,对相同数据花费 ~22 秒。虽然这更好,但它仍然很慢,所以我想也许我应该做一些数学而不是计算和计数^^
作为一个附带问题;我也觉得我的代码很笨拙(特别是 all_combos <- 行;我知道将字符串作为代码进行评估真的很糟糕);如果有人想指出如何提高我的代码的效率,我也将不胜感激。

最佳答案

可以使代码的某些步骤更加简洁。对于三元组,只需要向量,我们稍后使用 mget 获取它们。 .

NYS <- expand.grid(list(c("A", "C", "G", "T"), c("C", "T"), c("C", "G")))
VRM <- expand.grid(list(c("A", "C", "G"), c("A", "G"), c("A", "C")))

## triplets
NYS <- aas[match(Reduce(paste0, NYS), cods)]
VRM <- aas[match(Reduce(paste0, VRM), cods)]

## input vector
library_cods <- c("NYS", "VRM", "NYS", "NYS", "VRM", "VRM")

# columns as determined by library_cods, evaluate, expand
all_combos <- expand.grid(mget(library_cods))

# get the string from the rows
unique_seqs <- Reduce(paste0, all_combos)

# sort by count
unique_seqs <- data.frame(sort(table(unique_seqs), decreasing=T))
结果
head(unique_seqs)
# unique_seqs Freq
# 1 LRLLRR 729
# 2 ARLLRR 486
# 3 LGLLRR 486
# 4 LRALRR 486
# 5 LRLARR 486
# 6 LRLLGR 486
在我的系统上运行大约 16 秒,这是合理的。

关于r - 改进和加速代码以确定大量组合,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66262563/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com