gpt4 book ai didi

arrays - 包含在数字范围内的组哈希键

转载 作者:行者123 更新时间:2023-12-01 12:20:33 24 4
gpt4 key购买 nike

我有一个包含多种不同方法的组合尝试的数据集(接近 1 到 3)识别基因组中的位置:

source  chromosome1 bp1 chromosome2 bp2
attempt1 2L 5890205 2L 5890720
attempt2 2L 5890205 2L 5890721
attempt1 2L 22220720 2L 22255744
attempt1 3L 15568694 3L 15568866
attempt3 3R 14006279 3R 14008254
attempt1 3R 14006281 3R 14008253
attempt2 3R 14006282 3R 14008254
attempt3 3R 14006286 3R 14008254
attempt1 3R 32060908 3R 32061196
attempt1 3R 32066206 3R 32068392
attempt3 3R 32066206 3R 32068392
attempt2 3R 32066207 3R 32068393
attempt2 X 4574312 X 4576608
attempt1 X 4574313 X 4576607
attempt3 X 4574313 X 4576608

我想找到每次尝试都已确定的位置并将其分组,从而留出一定的错误空间。例如,我想对前两行进行分类...

source  chromosome1 bp1 chromosome2 bp2
attempt1 2L 5890205 2L 5890720
attempt2 2L 5890205 2L 5890721

...作为单个事件(事件 1),已通过两次不同的尝试(attempt1attempt2)识别。我想仅在不同尝试时将此类实例归类为单个事件:

  • 同意 bp1 +/- 5 的位置(即在窗口 5890200..5890210 内)
  • 识别相同的chromosome1chromosome2 (2L)
  • 同意 bp2 +/- 5 的位置(即在窗口 5890715..5890725 内)

我尝试使用每个染色体和 bp 作为散列中的单独键来实现此目的

my %SVs;
my $header;

# Make hash
while(<$in>){
chomp;
if ($. == 1){
$header = $_;
next;
}
my ($source, $chromosome1, $bp1, $chromosome2, $bp2) = split;

push @{$SVs{$chromosome1}{$bp1}{$chromosome2}{$bp2}}, $_;

}
}

...然后围绕每行的每个 bp1 和 bp2 值使用滑动窗口方法:

my %events;
for my $chr1 ( sort keys %SVs ){
for my $bp1 ( sort { $a <=> $b } keys $SVs{$chr1} ){
my $w1_start = ( $bp1 - 5 );
my $w1_end = ( $bp1 + 5 );
my $window1 = "$w1_start-$w1_end";

for my $chr2 ( sort keys $SVs{$chr1}{$bp1} ){
for my $bp2 ( sort { $a <=> $b } keys $SVs{$chr1}{$bp1}{$chr2} ){

my $w2_start = ( $bp2 - 5 );
my $w2_end = ( $bp2 + 5 );
my $window2 = "$w2_start-$w2_end";

for ( $w1_start .. $w1_end ){
if ($bp1 == $_){
push @{$events{$chr1}{$window1}}, @{$SVs{$chr1}{$bp1}{$chr2}{$bp2}};
}
}

for ( $w2_start .. $w2_end ){
if ($bp2 == $_){
push @{$events{$chr2}{$window2}}, @{$SVs{$chr1}{$bp1}{$chr2}{$bp2}};
}
}

}
}
}
}

print Dumper \%events;

这实现了我想要的一部分,但我不知道如何获得我想要的输出:

event   source  chromosome1 bp1 chromosome2 bp2
1 attempt1 2L 5890205 2L 5890720
1 attempt2 2L 5890205 2L 5890721
2 attempt1 2L 22220720 2L 22255744
3 attempt1 3L 15568694 3L 15568866
4 attempt3 3R 14006279 3R 14008254
4 attempt1 3R 14006281 3R 14008253
4 attempt2 3R 14006282 3R 14008254
4 attempt3 3R 14006286 3R 14008254
5 attempt1 3R 32060908 3R 32061196
6 attempt1 3R 32066206 3R 32068392
6 attempt3 3R 32066206 3R 32068392
6 attempt2 3R 32066207 3R 32068393
7 attempt2 X 4574312 X 4576608
7 attempt1 X 4574313 X 4576607
7 attempt3 X 4574313 X 4576608

最佳答案

下面通过添加到等价类的最后一个条目定义了每个等价类(基于我对您上面评论的理解):

#!/usr/bin/env perl

use strict;
use warnings;

run(\*DATA);

sub run {
my $fh = shift;
my @header = split ' ', scalar <$fh>;

my @events = ([ get_next_event($fh, \@header)]);

while (my $event = get_next_event($fh, \@header)) {
# change the -1 in the second subscript to 0
# if you want to always compare to the first
# event added to the equivalence class
if (same_event($events[-1][-1], $event, 5)) {
push @{ $events[-1] }, $event;
next;
}

push @events, [ $event ];
}

print join("\t", event => @header), "\n";
for my $i (1 .. @events) {
for my $ev (@{ $events[$i - 1] }) {
print join("\t", $i, @{$ev}{@header}), "\n";
}
}
}

sub get_next_event {
my $fh = shift;
my $header = shift;
return unless defined(my $line = <$fh>);
return unless $line =~ /\S/;

my %event;
@event{ @$header } = split ' ', $line;
return \%event;
}

sub same_event {
my ($x, $y, $threshold) = @_;

return if $x->{chromosome1} ne $y->{chromosome1};
return if abs($x->{bp1} - $y->{bp1}) > $threshold;
return if abs($x->{bp2} - $y->{bp2}) > $threshold;
return 1;
}

__DATA__
source chromosome1 bp1 chromosome2 bp2
attempt1 2L 5890205 2L 5890720
attempt2 2L 5890205 2L 5890721
attempt1 2L 22220720 2L 22255744
attempt1 3L 15568694 3L 15568866
attempt3 3R 14006279 3R 14008254
attempt1 3R 14006281 3R 14008253
attempt2 3R 14006282 3R 14008254
attempt3 3R 14006286 3R 14008254
attempt1 3R 32060908 3R 32061196
attempt1 3R 32066206 3R 32068392
attempt3 3R 32066206 3R 32068392
attempt2 3R 32066207 3R 32068393
attempt2 X 4574312 X 4576608
attempt1 X 4574313 X 4576607
attempt3 X 4574313 X 4576608

输出:

event   source  chromosome1 bp1 chromosome2 bp2
1 attempt1 2L 5890205 2L 5890720
1 attempt2 2L 5890205 2L 5890721
2 attempt1 2L 22220720 2L 22255744
3 attempt1 3L 15568694 3L 15568866
4 attempt3 3R 14006279 3R 14008254
4 attempt1 3R 14006281 3R 14008253
4 attempt2 3R 14006282 3R 14008254
4 attempt3 3R 14006286 3R 14008254
5 attempt1 3R 32060908 3R 32061196
6 attempt1 3R 32066206 3R 32068392
6 attempt3 3R 32066206 3R 32068392
6 attempt2 3R 32066207 3R 32068393
7 attempt2 X 4574312 X 4576608
7 attempt1 X 4574313 X 4576607
7 attempt3 X 4574313 X 4576608

关于arrays - 包含在数字范围内的组哈希键,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44570474/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com