gpt4 book ai didi

regex - 查找 DNA 序列中所有重复的 4 聚体 - Perl

转载 作者:行者123 更新时间:2023-12-05 00:51:04 24 4
gpt4 key购买 nike

你好,

我尝试编写一个程序,该程序读取包含多个 DNA 序列的 FASTA 格式的文件,识别序列中所有重复的 4-mers(即所有 4-mers 出现多次),并打印出重复的 4-mers以及在其中找到它的序列的标题。 k-mer 只是 k 个核苷酸的序列(例如,“aaca”、“gacg”和“tttt”是 4-mers)。

这是我的代码:

use strict;
use warnings;

my $count = -1;
my $file = "sequences.fa";
my $seq = '';
my @header = ();
my @sequences = ();
my $line = '';
open (READ, $file) || die "Cannot open $file: $!.\n";

while ($line = <READ>){
chomp $line;
if ($line =~ /^>/){
push @header, $line;
$count++;
unless ($seq eq ''){
push @sequences, $seq;
$seq = '';
}
} else {
$seq .= $line;
}
} push @sequences, $line;

for (my $i = 0; $i <= $#sequences+1; $i++){
if ($sequences[$i] =~ /(....)(.)*\g{1}+/g){
print $header[$i], "\n", $&, "\n";
}
}

我有两个请求:首先,我不知道如何设计我的正则表达式模式以获得所需的输出。
其次,不太重要的是,我确信我的代码效率很低,所以如果有办法缩短它,请告诉我。

提前致谢!

这是 FASTA 文件的示例:(请注意,序列之间有一个额外的行,原始 fasta 文件中并非如此)

>NC_001422.1 Enterobacteria phage phiX174 sensu lato, complete genome GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTttttttCGGATATTTCTGATGAGTCGAAAAAT CCCTTACTTGAGGATAtatataAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCT

>NC_001501.1 Enterobacteria phage phiX184 sensu lato, complete genome AACGGCTGGTCAGTATTTAAGGTTAGTGCTGAGGTTGACTACATCTGTTTTTAGAGACCCAGACCTTTTA TCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTA TATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTgagagagaGGTTTTCTTCATTGCATTCAGATGGA TCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGC CTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTG

>NC_001622.5 Enterobacteria phage phiX199 sensu lato, complete genome TTCGCTGAATCAGGTTATTAAAGAGTTGCCGAGATATTTATGTTGGTTTCATGCGGATTGGTCGTTTAAA TTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATAATGACCAAATCAAAGAACTCGTGATTAT CTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGATGCTTCCTCTGCTGGTATGG TTGACGCCGGATTTGAGAATCAAAAATGTGAGAGAGCTTACTAAAATGCAACTGGACAATCAGAAAGAGA GATGCAAAATGAGACTCAAAAAGAGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGAC CAGGTATATGCACAAAATGAGATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTA TGGAAAACACCAATCTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCA AACGGCTGGTCAGTATTTTACCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGAC TTAGATGAGTGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAG

最佳答案

我可能会更像这样解决您的问题:

#!/usr/bin/env perl

use strict;
use warnings;

use Data::Dumper;

#set paragraph mode. Iterate on blank lines.
local $/ = '';

#read from STDIN or a file specified on command line,
#e.g. cat filename_here | myscript.pl
#or myscript.pl filename_here
while ( <> ) {
#capture the header line, and then remove it from our data block
my ($header) = m/\>(.*)/;
s/>.*$//;

#remove linefeeds and whitespace.
s/\s*\n\s*//g;
#use lookahead pattern, so the data isn't 'consumed' by the regex.
my @sequences = m/(?=([atcg]{4}))/gi;

#increment a count for each sequence found.
my %count_of;
$count_of{$_}++ for @sequences;

#print output. (Modify according to specific needs.
print $header,"\n";

print "Found sequences:\n";
print Dumper \@sequences;
print "Count:\n";
print Dumper \%count_of;

#note - ordered, but includes duplicates.
#you could just use keys %count_of, but that would be unordered.
foreach my $sequence ( grep { $count_of{$_} > 1 } @sequences ) {
print $sequence, " => ", $count_of{$sequence},"\n";
}
print "\n";
}

我们逐条记录迭代,捕获并删除“标题”行,然后将其余部分拼接在一起。然后捕获 4 的每个(重叠)序列,并对它们进行计数。

这对于您的示例数据(为简洁起见,第一节):
NC_001422.1 Enterobacteria phage phiX174 sensu lato, complete genome 
Found sequences:
GAGT => 2
AGTT => 2
TTAT => 2
CATG => 2
ATGA => 3
TGAC => 2
CGCA => 2
AGTT => 2
ACTT => 2
tttt => 3
tttt => 3
tttt => 3
GGAT => 2
GATA => 2
ATAT => 2
TATT => 2
ATGA => 3
TGAG => 2
GAGT => 2
AAAA => 2
AAAA => 2
ACTT => 2
TGAG => 2
GGAT => 2
GATA => 2
tata => 2
tata => 2
TTAT => 2
TATG => 2
ATAT => 2
TATT => 2
GCCG => 2
TATG => 2
GCCG => 2
CGCA => 2
CATG => 2
ATGA => 3
TGAC => 2

注意 - 因为它基于原始序列,它基于数据中的排序,你会在那里看到两次 TGAC,因为......它在那里两次。

但是,您可以改为:
   foreach my $sequence ( sort { $count_of{$b} <=> $count_of{$a} }
grep { $count_of{$_} > 1 }
keys %count_of ) {
print $sequence, " => ", $count_of{$sequence},"\n";
}
print "\n";

这将丢弃任何少于 2 个匹配项,并按频率排序。

关于regex - 查找 DNA 序列中所有重复的 4 聚体 - Perl,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44796788/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com