gpt4 book ai didi

r - 列出序列的所有突变(DNA)

转载 作者:行者123 更新时间:2023-12-04 10:49:21 27 4
gpt4 key购买 nike

我有一个DNA序列,我想在DNA序列读取列表中找到它的所有实例,或其任何可能的突变。我正在使用grepl来执行此操作,因为在我使用它的实例中,它比matchPattern快。我使用parLapply将我的突变载体喂入grepl函数。但是我感兴趣的是,提供了一种自动生成序列突变载体的简便方法。最初,我键入每个突变,但留给人为错误的余地,如果延长序列,则需要键入更多的突变。另外,我当前的代码仅允许1个突变,某些序列应允许比其他序列更多的突变。我不是在寻找有人为我编写循环,而只是给我建议考虑任何字符串。

现在,我有一种产生突变的半自动方法。现在,无需我全部输入即可生成载体,但仅适用于8个核苷酸长的序列。必须有一种更好的方法来为任何长度的任何核苷酸序列生成载体。

这是我的代码:

#My sequence of interest
seq1 <- "GGCGACTG"
lenseq1 <- nchar(seq1)

#A vector of the length of the sequence I wish to create all mutations of
mutsinseq1 <- rep(seq1, 5*lenseq1+4*(lenseq1-1)+1)

#The possible substitutions, insertions, and deletions to the sequence of interest
possnuc <- c("A","T","C","G","")
lenpossnuc <- length(possnuc)

#changing all elements of the vector except for the first
#the first 8 if statements are nucleotide substitutions or deletions
#the other if statements allow for inserts between nucleotides
for(i in 2:length(mutsinseq1)){
if(i<7){
mutsinseq1[i] <- paste(possnuc[i-1],substr(seq1,2,lenseq1),sep = "")
} else if(i<12){
mutsinseq1[i] <- paste(substr(seq1,1,1),possnuc[i-6],substr(seq1,3,lenseq1),sep = "")
} else if(i<17){
mutsinseq1[i] <- paste(substr(seq1,1,2),possnuc[i-11],substr(seq1,4,lenseq1),sep = "")
} else if(i<22){
mutsinseq1[i] <- paste(substr(seq1,1,3),possnuc[i-16],substr(seq1,5,lenseq1),sep = "")
} else if(i<27){
mutsinseq1[i] <- paste(substr(seq1,1,4),possnuc[i-21],substr(seq1,6,lenseq1),sep = "")
} else if(i<32){
mutsinseq1[i] <- paste(substr(seq1,1,5),possnuc[i-26],substr(seq1,7,lenseq1),sep = "")
} else if(i<37){
mutsinseq1[i] <- paste(substr(seq1,1,6),possnuc[i-31],substr(seq1,8,lenseq1),sep = "")
} else if(i<42){
mutsinseq1[i] <- paste(substr(seq1,1,7),possnuc[i-36],sep = "")
} else if(i<46){
mutsinseq1[i] <- paste(substr(seq1,1,1),possnuc[i-41],substr(seq1,2,lenseq1),sep = "")
} else if(i<50){
mutsinseq1[i] <- paste(substr(seq1,1,2),possnuc[i-45],substr(seq1,3,lenseq1),sep = "")
} else if(i<54){
mutsinseq1[i] <- paste(substr(seq1,1,3),possnuc[i-49],substr(seq1,4,lenseq1),sep = "")
} else if(i<58){
mutsinseq1[i] <- paste(substr(seq1,1,4),possnuc[i-53],substr(seq1,5,lenseq1),sep = "")
} else if(i<62){
mutsinseq1[i] <- paste(substr(seq1,1,5),possnuc[i-57],substr(seq1,6,lenseq1),sep = "")
} else if(i<66){
mutsinseq1[i] <- paste(substr(seq1,1,6),possnuc[i-61],substr(seq1,7,lenseq1),sep = "")
} else{
mutsinseq1[i] <- paste(substr(seq1,1,7),possnuc[i-65],substr(seq1,8,lenseq1),sep = "")
}
}

#getting rid of duplicate mutations
mutsinseq1 <- mutsinseq1[-which(duplicated(mutsinseq1))]


以下是我希望产生的内容(由当前代码产生):

mutsinseq1
[1] "GGCGACTG" "AGCGACTG" "TGCGACTG" "CGCGACTG" "GCGACTG" "GACGACTG" "GTCGACTG" "GCCGACTG" "GGAGACTG" "GGTGACTG" "GGGGACTG" "GGGACTG" "GGCAACTG"
[14] "GGCTACTG" "GGCCACTG" "GGCACTG" "GGCGTCTG" "GGCGCCTG" "GGCGGCTG" "GGCGCTG" "GGCGAATG" "GGCGATTG" "GGCGAGTG" "GGCGATG" "GGCGACAG" "GGCGACCG"
[27] "GGCGACGG" "GGCGACG" "GGCGACTA" "GGCGACTT" "GGCGACTC" "GGCGACT" "GAGCGACTG" "GTGCGACTG" "GCGCGACTG" "GGGCGACTG" "GGACGACTG" "GGTCGACTG" "GGCCGACTG"
[40] "GGCAGACTG" "GGCTGACTG" "GGCGGACTG" "GGCGAACTG" "GGCGTACTG" "GGCGCACTG" "GGCGATCTG" "GGCGACCTG" "GGCGAGCTG" "GGCGACATG" "GGCGACTTG" "GGCGACGTG" "GGCGACTAG"
[53] "GGCGACTCG" "GGCGACTGG"


我该如何解决这个问题?

最佳答案

在其他语言中,您可能会使用一系列嵌套循环来执行此操作,但是在R中,有一些不错的组合函数。以下是执行所需功能的总体功能:

library(stringr)
library(purrr)
library(dplyr)

mutate_sequence <- function(string, num = 1, nucleotides = c("A","T","C","G","_")) {
l_str <- str_length(string)

choices <- cross(list(
cols = combn(seq_len(l_str), num, simplify = F),
muts = cross(rerun(num, nucleotides)) %>% map(unlist)
))

choice_matrix <-
map_dfr(choices, as_tibble, .id = "rows") %>%
mutate(rows = as.numeric(rows))

seq_matrix <- str_split(rep(string, max(choice_matrix$rows)), "", simplify = T)

seq_matrix[as.matrix(choice_matrix[,1:2])] <- str_to_lower(choice_matrix$muts)
apply(seq_matrix, 1, paste, collapse = "")
}


我使用了一些程序包使事情变得容易一些,但是都可以翻译成R底。

这是示例输出:

mutate_sequence("ATCG", num = 2)



  [1] "aaCG" "aTaG" "aTCa" "AaaG" "AaCa" "ATaa" "taCG" "tTaG" "tTCa" "AtaG" "AtCa" "ATta" "caCG" "cTaG"
[15] "cTCa" "AcaG" "AcCa" "ATca" "gaCG" "gTaG" "gTCa" "AgaG" "AgCa" "ATga" "_aCG" "_TaG" "_TCa" "A_aG"
[29] "A_Ca" "AT_a" "atCG" "aTtG" "aTCt" "AatG" "AaCt" "ATat" "ttCG" "tTtG" "tTCt" "AttG" "AtCt" "ATtt"
[43] "ctCG" "cTtG" "cTCt" "ActG" "AcCt" "ATct" "gtCG" "gTtG" "gTCt" "AgtG" "AgCt" "ATgt" "_tCG" "_TtG"
[57] "_TCt" "A_tG" "A_Ct" "AT_t" "acCG" "aTcG" "aTCc" "AacG" "AaCc" "ATac" "tcCG" "tTcG" "tTCc" "AtcG"
[71] "AtCc" "ATtc" "ccCG" "cTcG" "cTCc" "AccG" "AcCc" "ATcc" "gcCG" "gTcG" "gTCc" "AgcG" "AgCc" "ATgc"
[85] "_cCG" "_TcG" "_TCc" "A_cG" "A_Cc" "AT_c" "agCG" "aTgG" "aTCg" "AagG" "AaCg" "ATag" "tgCG" "tTgG"
[99] "tTCg" "AtgG" "AtCg" "ATtg" "cgCG" "cTgG" "cTCg" "AcgG" "AcCg" "ATcg" "ggCG" "gTgG" "gTCg" "AggG"
[113] "AgCg" "ATgg" "_gCG" "_TgG" "_TCg" "A_gG" "A_Cg" "AT_g" "a_CG" "aT_G" "aTC_" "Aa_G" "AaC_" "ATa_"
[127] "t_CG" "tT_G" "tTC_" "At_G" "AtC_" "ATt_" "c_CG" "cT_G" "cTC_" "Ac_G" "AcC_" "ATc_" "g_CG" "gT_G"
[141] "gTC_" "Ag_G" "AgC_" "ATg_" "__CG" "_T_G" "_TC_" "A__G" "A_C_" "AT__"



我将这些突变设为小写或“ _”以使其在位置上很明显,但是您可以轻松地对其进行更改以使其恢复为“正常”序列。

因此,每一行都会做一些事情:

l_str <- str_length(string)


获取字符串中的字符数。

combn(seq_len(l_str), num, simplify = F)


1)这是沿着序列(索引)的位置的所有可能组合,一次获取 num的突变数。

rerun(num, nucleotides)


2)这会重复您的核苷酸载体 num次,并使其成为列表。 cross(rerun(num, nucleotides))然后将列表中的每个组合作为列表提供,因此您正在使用核苷酸的所有可能组合,并带有重复序列。 cross(rerun(num, nucleotides)) %>% map(unlist)将列表的最深层折叠为向量。

因此,最后两个区块为您提供了各种可能的位置选择,然后为您提供了每种可能的替换组合。然后,我们需要将所有可能的组合成对!

  choices <- cross(list(
cols = combn(seq_len(l_str), num, simplify = F),
muts = cross(rerun(num, nucleotides)) %>% map(unlist)
))


对于上面的输出,这意味着:


[[1]]
[[1]]$`cols`
[1] 1 2

[[1]]$muts
[1] "A" "A"


[[2]]
[[2]]$`cols`
[1] 1 2

[[2]]$muts
[1] "T" "A"
...



因此,首先对位置1/2给出A / A,T / A,G / A,C / A,_ / A等。然后再对位置1/3再组合每种组合,然后对位置1/4再组合2/3,然后2/4,然后3/4。

因此,现在您有了这个长长的清单,让我们变得更好。首先,我们使用 colsmuts将每个元素放入一个数据帧中,然后将它们全部绑定到一个单独的元素中,每个元素的标识符称为 rows

map_dfr(choices, as_tibble, .id = "rows")



# A tibble: 50 x 3
rows cols muts
<chr> <int> <chr>
1 1 1 A
2 1 2 A
3 2 1 T
4 2 2 A
5 3 1 C
6 3 2 A
7 4 1 G
8 4 2 A
9 5 1 _
10 5 2 A
# ... with 40 more rows



这给了我们很长的数据帧。每个 rows是一个输出字符串,而 cols告诉我们该字符串中的哪个位置将被替换。 muts是将在这些位置使用的字符。为了稍后进行子设置,我们将使用 rowsmutate(...)转换为数字。

seq_matrix <- str_split(rep(string, max(choice_matrix$rows)), "", simplify = T)


现在,我们采用您的原始字符串,并重复 choice_matrix告诉我们,我们将使用突变的序列,并重复多次。然后,我们采用该向量,并沿字符边界分割每个向量:


      [,1] [,2] [,3] [,4]
[1,] "A" "T" "C" "G"
[2,] "A" "T" "C" "G"
[3,] "A" "T" "C" "G"
[4,] "A" "T" "C" "G"
[5,] "A" "T" "C" "G"
[6,] "A" "T" "C" "G"
...



现在我们有了一个大矩阵,R在这些大矩阵上运算很快。我们可以使用矩阵运算来完成所有其他步骤,但是这似乎比使用此列表组合功能还要多。

seq_matrix[as.matrix(choice_matrix[,1:2])] <- str_to_lower(choice_matrix$muts)


这根据 rows中的 colschoice_matrix标识每个位置。然后将 muts中的适当值放入其中。您也可以在这里取出 str_to_lower使其保持小写。您将更改默认参数 nucleotides以使 "_"成为 ""


       [,1] [,2] [,3] [,4]
[1,] "a" "a" "C" "G"
[2,] "a" "T" "a" "G"
[3,] "a" "T" "C" "a"
[4,] "A" "a" "a" "G"
[5,] "A" "a" "C" "a"
[6,] "A" "T" "a" "a"
...



因此,第1行在位置1和2处获得“ A”和“ A”。然后第2行在位置1和3中获得“ A”和“ A”,依此类推。现在,我们只需要在每行中 apply(即 1中的 apply(..., 1, ...)确实有一个功能,可以将每一行合并为一个字符串。那将是 paste(..., collapse = "")

这将迅速为您带来巨大的输出。如果您对原始的8个核苷酸序列进行3个突变,则输出为7000。4个突变为43750。每次变慢得多,大约需要5s才能在我的桌面上运行这4个突变。您可以预先计算输出长度,即 choose(l_str, num) * length(nucleotides)^num



再次更新:

要处理插入和删除操作,我们只需要字符矩阵为每个可能的插入都有一个插槽。这是该版本:

mutate_sequence <- function(string, num = 1, nucleotides = c("A","T","C","G","")) {
if (num < 1) {return(string)}

l_str <- str_length(string)
l_pos <- (num + 1)*(l_str - 1) + 1

choices <- cross(list(
cols = combn(seq_len(l_pos), num, simplify = F),
muts = cross(rerun(num, nucleotides)) %>% map(unlist)
))

choice_matrix <-
map_dfr(choices, as_data_frame, .id = "rows") %>%
mutate(rows = as.numeric(rows))

blanks <- character(l_pos)
orig_pos <- (seq_len(l_str) - 1) * (num+1) + 1
blanks[orig_pos] <- str_split(string, "", simplify = T)

seq_matrix <- matrix(
rep(blanks, max(choice_matrix$rows)),
ncol = l_pos, byrow = T
)

seq_matrix[as.matrix(choice_matrix[,1:2])] <- str_to_lower(choice_matrix$muts)
sequences <- apply(seq_matrix, 1, paste, collapse = "")
sequences[!duplicated(str_to_upper(sequences))]
}


这与上面函数的版本基本相同,但是首先创建一个空白向量,每个插入点都有足够的斑点。对于每个原始核苷酸,除最后一个核苷酸外,还需要在其后插入一个额外的斑点。算出到 l_pos <- (num + 1)*(l_str - 1) + 1位置。 character(l_pos)为您提供空白,然后在 (seq_len(l_str) - 1) * (num+1) + 1处用原始核苷酸填充空白。

例如,具有两个突变的 ATCG变为 "A" "" "" "T" "" "" "C" "" "" "G"。该功能的其余部分工作相同,只是将每个可能的核苷酸(或缺失)置于每个可能的位置。

将所有内容重新组合在一起之前的输出如下所示:

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "a" "a" "" "T" "" "" "C" "" "" "G"
[2,] "a" "" "a" "T" "" "" "C" "" "" "G"
[3,] "a" "" "" "a" "" "" "C" "" "" "G"
[4,] "a" "" "" "T" "a" "" "C" "" "" "G"
[5,] "a" "" "" "T" "" "a" "C" "" "" "G"
...


然后,在每行 paste之后,我们可以使用 paste检查重复项并将其排除。我们也可以摆脱小写字母的变异,而使用 duplicated代替。现在输出比以前短得多,即278的前55个:


  [1] "aaTCG"  "aaCG"   "aTaCG"  "aTaG"   "aTCaG"  "aTCa"   "AaaTCG" "AaaCG"  "AaTaCG" "AaTaG"  "AaTCaG"
[12] "AaTCa" "AaaG" "AaCaG" "AaCa" "ATaaCG" "ATaaG" "ATaCaG" "ATaCa" "ATaa" "ATCaaG" "ATCaa"
[23] "taTCG" "taCG" "tTaCG" "tTaG" "tTCaG" "tTCa" "AtaTCG" "AtTaCG" "AtTaG" "AtTCaG" "AtTCa"
[34] "ATta" "ATCtaG" "ATCta" "caTCG" "caCG" "cTaCG" "cTaG" "cTCaG" "cTCa" "AcaTCG" "AcaCG"
[45] "AcTaCG" "AcTaG" "AcTCaG" "AcTCa" "AcaG" "AcCaG" "AcCa" "ATcaCG" "ATcCaG" "ATcCa" "gaTCG"
...

关于r - 列出序列的所有突变(DNA),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57226087/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com