gpt4 book ai didi

algorithm - 如何匹配DNA序列模式

转载 作者:塔克拉玛干 更新时间:2023-11-03 02:21:28 25 4
gpt4 key购买 nike

我在寻找解决此问题的方法时遇到了麻烦。

输入输出顺序如下:

 **input1 :** aaagctgctagag 
**output1 :** a3gct2ag2

**input2 :** aaaaaaagctaagctaag
**output2 :** a6agcta2ag

输入 nsequence 可以是 10^6 个字符,并且将考虑最大的连续模式。

例如,对于输入 2“agctaagcta”,输出将不是“agcta2gcta”,而是“agcta2”。

感谢任何帮助。

最佳答案

算法说明:

  • 有一个序列 S,其符号为 s(1)、s(2)、…、s(N)。
  • 设 B(i) 为包含元素 s(1)、s(2)、…、s(i) 的最佳压缩序列。
  • 因此,例如,B(3) 将是 s(1)、s(2)、s(3) 的最佳压缩序列。
  • 我们想知道的是B(N)。

要找到它,我们将通过归纳法进行。我们要计算B(i+1),已知B(i), B(i-1), B(i-2), …, B(1), B(0),其中B(0)为空序列,并且 B(1) = s(1)。同时,这构成了解最优的证明。 ;)

为了计算 B(i+1),我们将在候选序列中选择最佳序列:

  1. 最后一个 block 有一个元素的候选序列:

    B(i)s(i+1)1B(i-1)s(i+1)2 ;仅当 s(i) = s(i+1)B(i-2)s(i+1)3 ;仅当 s(i-1) = s(i) 且 s(i) = s(i+1)……B(1)s(i+1)[i-1] ;仅当 s(2)=s(3) 且 s(3)=s(4) 且……且 s(i) = s(i+1)B(0)s(i+1)i = s(i+1)i ;仅当 s(1)=s(2) and s(2)=s(3) and … and s(i) = s(i+1)

  2. 最后一个 block 有 2 个元素的候选序列:

    B(i-1)s(i)s(i+1)1B(i-3)s(i)s(i+1)2 ;仅当 s(i-2)s(i-1)=s(i)s(i+1)B(i-5)s(i)s(i+1)3 ;仅当 s(i-4)s(i-3)=s(i-2)s(i-1) 和 s(i-2)s(i-1)=s(i)s(i+1) )……

  3. 最后一个 block 有 3 个元素的候选序列:

  4. 最后一个 block 有 4 个元素的候选序列:

  5. 最后一个 block 有 n+1 个元素的候选序列:

    s(1)s(2)s(3)…………s(i+1)

对于每一种可能性,算法在序列 block 不再重复时停止。就是这样。

算法在 psude-c 代码中将是这样的:

B(0) = “”
for (i=1; i<=N; i++) {
// Calculate all the candidates for B(i)
BestCandidate=null
for (j=1; j<=i; j++) {
Calculate all the candidates of length (i)

r=1;
do {
Candidadte = B([i-j]*r-1) s(i-j+1)…s(i-1)s(i) r
If ( (BestCandidate==null)
|| (Candidate is shorter that BestCandidate))
{
BestCandidate=Candidate.
}
r++;
} while ( ([i-j]*r <= i)
&&(s(i-j*r+1) s(i-j*r+2)…s(i-j*r+j) == s(i-j+1) s(i-j+2)…s(i-j+j))

}
B(i)=BestCandidate
}

希望这能对您有所帮助。

下面给出了执行所需任务的完整 C 程序。它在 O(n^2) 中运行。核心部分只有30行代码。

编辑 我对代码进行了一些重组,更改了变量的名称并添加了一些注释以提高可读性。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>


// This struct represents a compressed segment like atg4, g3, agc1
struct Segment {
char *elements;
int nElements;
int count;
};
// As an example, for the segment agagt3 elements would be:
// {
// elements: "agagt",
// nElements: 5,
// count: 3
// }

struct Sequence {
struct Segment lastSegment;
struct Sequence *prev; // Points to a sequence without the last segment or NULL if it is the first segment
int totalLen; // Total length of the compressed sequence.
};
// as an example, for the sequence agt32ta5, the representation will be:
// {
// lastSegment:{"ta" , 2 , 5},
// prev: @A,
// totalLen: 8
// }
// and A will be
// {
// lastSegment{ "agt", 3, 32},
// prev: NULL,
// totalLen: 5
// }


// This function converts a sequence to a string.
// You have to free the string after using it.
// The strategy is to construct the string from right to left.

char *sequence2string(struct Sequence *S) {
char *Res=malloc(S->totalLen + 1);
char *digits="0123456789";

int p= S->totalLen;
Res[p]=0;

while (S!=NULL) {
// first we insert the count of the last element.
// We do digit by digit starting with the units.
int C = S->lastSegment.count;
while (C) {
p--;
Res[p] = digits[ C % 10 ];
C /= 10;
}

p -= S->lastSegment.nElements;
strncpy(Res + p , S->lastSegment.elements, S->lastSegment.nElements);

S = S ->prev;
}


return Res;
}


// Compresses a dna sequence.
// Returns a string with the in sequence compressed.
// The returned string must be freed after using it.
char *dnaCompress(char *in) {
int i,j;

int N = strlen(in);; // Number of elements of a in sequence.



// B is an array of N+1 sequences where B(i) is the best compressed sequence sequence of the first i characters.
// What we want to return is B[N];
struct Sequence *B;
B = malloc((N+1) * sizeof (struct Sequence));

// We first do an initialization for i=0

B[0].lastSegment.elements="";
B[0].lastSegment.nElements=0;
B[0].lastSegment.count=0;
B[0].prev = NULL;
B[0].totalLen=0;

// and set totalLen of all the sequences to a very HIGH VALUE in this case N*2 will be enougth, We will try different sequences and keep the minimum one.
for (i=1; i<=N; i++) B[i].totalLen = INT_MAX; // A very high value

for (i=1; i<=N; i++) {

// at this point we want to calculate B[i] and we know B[i-1], B[i-2], .... ,B[0]
for (j=1; j<=i; j++) {

// Here we will check all the candidates where the last segment has j elements

int r=1; // number of times the last segment is repeated
int rNDigits=1; // Number of digits of r
int rNDigitsBound=10; // We will increment r, so this value is when r will have an extra digit.
// when r = 0,1,...,9 => rNDigitsBound = 10
// when r = 10,11,...,99 => rNDigitsBound = 100
// when r = 100,101,.,999 => rNDigitsBound = 1000 and so on.

do {

// Here we analitze a candidate B(i).
// where the las segment has j elements repeated r times.

int CandidateLen = B[i-j*r].totalLen + j + rNDigits;
if (CandidateLen < B[i].totalLen) {
B[i].lastSegment.elements = in + i - j*r;
B[i].lastSegment.nElements = j;
B[i].lastSegment.count = r;
B[i].prev = &(B[i-j*r]);
B[i].totalLen = CandidateLen;
}

r++;
if (r == rNDigitsBound ) {
rNDigits++;
rNDigitsBound *= 10;
}
} while ( (i - j*r >= 0)
&& (strncmp(in + i -j, in + i - j*r, j)==0));

}
}

char *Res=sequence2string(&(B[N]));
free(B);

return Res;
}

int main(int argc, char** argv) {
char *compressedDNA=dnaCompress(argv[1]);
puts(compressedDNA);
free(compressedDNA);
return 0;
}

关于algorithm - 如何匹配DNA序列模式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16871113/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com