gpt4 book ai didi

python - 搜索基因组中不匹配的序列

转载 作者:行者123 更新时间:2023-12-01 05:45:11 25 4
gpt4 key购买 nike

我有一个 fastq 文件,其中包含超过 1 亿个读数和长度为 10000 的基因组序列

我想从 fastq 文件中取出序列并在基因组序列中搜索,允许 3 个不匹配

我尝试使用 awk 以这种方式从 fastq 文件中获取序列:

1.fq(几行)

@DH1DQQN1:269:C1UKCACXX:1:1101:1207:2171 1:N:0:TTAGGC NATCCCCATCCTCTGCTTGCTTTTCGGGATATGTTGTAGGATTCTCAGC

+

1=ADBDDHD;F>GF@FFEFGGGIAEEI?D9DDHHIGAAF:BG39?BB

@DH1DQQN1:269:C1UKCACXX:1:1101:1095:2217 1:N:0:TTAGGC TAGGATTTCAAATGGGTCGAGGTGGTCCGTTAGGTATAGGGGCAACAGG

+

??AABDD4C:DDDI+C:C3@:C):1?*):?)?################

$ awk 'NR%4==2' 1.fq

NATCCCCATCCTCTGCTTGCTTTTCGGGATATGTTGTAGGATTCTCAGC TAGGATTTCAAATGGGTCGAGGTGGTCCGTTAGGTATAGGGGCAACAGG

我的文件中有所有序列,现在我想获取每一行序列并在基因组序列中搜索,允许 3 个不匹配,如果找到则打印序列

示例:

基因组序列文件:

GGGGAGGAATATGATTTACAGTTTATTTTTCAACTGTGCAAAATAACCTTAACTGCAGACGTTATGACATACATACATTCTATGAATTCCACTATTTTGGAGGACTGGAATTTTGGTCTACAACCTCCCCCAGGAGGCACACTAGAAGATACTTATAGGTTTGTAACCCAGGCAATTGCTTGTCAAAAACATACA

搜索序列文件:

GGGGAGGAATATGAT

GGGGAGGAATATGAA

GGGGAGGAATATGCC

TCAAAAACATAGG

TCAAAAACATGGG

输出文件:

GGGGAGGAATATGAT 0 # 0 mismatch exact sequence

GGGGAGGAATATGAA 1 # 1 mismatch

GGGGAGGAATATGCC 2 # 2 mismatch

TCAAAAACATAGG 2 # 2 mismatch

TCAAAAACATGGG 3 # 3 mismatch

最佳答案

类似的东西?

use 5.012;
use strict;
use warnings;
use String::Approx qw(aslice);
use File::Slurp;
use Data::Dumper;

my $genseq = "gseq.txt"; #the long sequence

$_ = read_file($genseq);

#read small patterns from stdin
while(my $patt = <>) {
chomp $patt;
my $len = length($patt);
my($index, $size, $distance) = aslice($patt, ["3D0S3", "minimal_distance"]);
say "$patt matched approx. at $index with mismatch $distance" if $distance <= 3;
}

为您输入生成:

GGGGAGGAATATGAT matched approx. at 0 with mismatch 0
GGGGAGGAATATGAA matched approx. at 0 with mismatch 1
GGGGAGGAATATGCC matched approx. at 0 with mismatch 2
TCAAAAACATAGG matched approx. at 179 with mismatch 2
TCAAAAACATGGG matched approx. at 179 with mismatch 3

老实说,不知道如何使用 10000 个字符长的 genseq...

关于python - 搜索基因组中不匹配的序列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16343985/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com