performance - 为什么我的程序使用 Tie::File 运行得这么慢？-6ren

performance - 为什么我的程序使用 Tie::File 运行得这么慢？

转载作者：行者123 更新时间：2023-12-04 03:13:27

#!/usr/bin/perl
use strict;
use warnings;
use Tie::File;
use Data::Dumper;
use Benchmark;

my $t0 = Benchmark->new;

# all files in the current folder with $ext will be input.
# Default $ext is "pileup"
# if entered, second user entered input will be set to $ext
my $ext = "pileup";
if(exists $ARGV[1]) {
    $ext = $ARGV[1];
}

# open current directory & store filenames with $ext into @pileupfiles
opendir (DIR, ".");
my @pileupfiles = grep {-f && /\.$ext$/} readdir DIR;

my $dnasegment;
my $pos;
my $total;
my $g_total;
my @index; #hold current index for each tied file
my @totalfiles; #hold total files in each sub-index

# $filenum is iterator to cycle through all pileup files whose names are stored   in pileupfiles
my $filenum = 0;
# @tied is an array holding all arrays of tied files
my @tied;
# array of the current line number for each @file, 
my @linenum;
# tie each file to an array that is an element of the @tied array
while($filenum < scalar @pileupfiles) {
    my @file;
    tie @file, 'Tie::File', $pileupfiles[$filenum], recsep => "\n"  or die;
    push(@tied, [@file]);
    # set each line's value of $linenum to 0
    push(@linenum, 0);
    $filenum++;
}

# open user list of dnasegments
open(LIST, $ARGV[0]);
# open file for output
open(OUT, ">>tempfile.tab");

while(<LIST>) {
    $dnasegment = $_;
    chomp $dnasegment;

    my $exit = 0;
    $pos = 1;
    my %flag;

    while(scalar(keys %flag) < scalar @tied) {
        $total = 0;
        $filenum = 0;
        while($filenum < scalar @tied) {
            if(exists $tied[$filenum][$linenum[$filenum]]) {
                my @line = split(/\t/, $tied[$filenum][$linenum[$filenum]]);
                #print $line[0], "\t", $line[1], "\t", $line[3], "\n\n";
                if($line[0] eq $dnasegment) {
                    if($line[1] == $pos) {
                        $total += $line[3];
                        $linenum[$filenum]++;
                        $g_total += $line[3];
                        print OUT "$dnasegment\t$filenum\t$pos\t$line[3]\n";
                    }
                } else {
                    $flag{$filenum} = 1;
                }
            } else {
                #print $flag, "\n";
                $flag{$filenum} = 1;
            }
            $filenum++;
        }
        if($total > 0) {
            print OUT "$dnasegment\t$total\n";
        }
        $pos++;
    }
}

close (LIST);
close(OUT);

my $t1 = Benchmark->new;
my $td = timediff($t1, $t0);
print timestr($td), "\n";

以上代码获取目录中所有具有默认或用户输入的文件扩展名的文件，并计算特定条目(输入文件的第 1 列)的位置(输入文件的第 2 列)的总出现次数输入文件，其中第 1 列与命令行提供的文件中包含的名称相匹配)。程序要使用的文件布局是:文件 1:

    Gm02    11896804    G   2   .,  \'
    Gm02    11896805    G   7   ......, U`
    Gm02    11896806    G   3   .,. Sa
    Gm02    11896807    T   2   .,  U\
    Gm02    11896808    T   2   .,  ZZ
    Gm02    11896809    T   2   .,  ZZ
    Gm02    11896810    T   2   .,  B\
    Gm02    11896811    G   3   .,^!,   B]E
    Gm02    11896812    A   3   T,, BaR
    Gm02    11896822    G   3   .,, B`D

文件 2:

    Gm02    11896804    G   3   .,, \'
    Gm02    11896805    G   7   ......, U`
    Gm02    11896806    G   3   .,. Sa
    Gm02    11896807    T   2   .,  U\
    Gm02    11896808    T   2   .,  ZZ
    Gm02    11896809    T   2   .,  ZZ
    Gm02    11896810    T   2   .,  B\
    Gm02    11896811    G   3   .,^!,   B]E
    Gm02    11896812    A   3   T,, BaR
    Gm02    11896813    G   3   .,, B`D

文件 3:

    Gm02    11896804    G   3   .,, \'
    Gm02    11896805    G   7   ......, U`
    Gm02    11896806    G   3   .,. Sa
    Gm02    11896807    T   2   .,  U\
    Gm02    11896808    T   2   .,  ZZ
    Gm02    11896809    T   2   .,  ZZ
    Gm02    11896810    T   2   .,  B\
    Gm02    11896811    G   3   .,^!,   B]E
    Gm02    11896812    A   3   T,, BaR
    Gm02    11896833    G   3   .,, B`D

在这种情况下，传递给程序的唯一命令行参数将是一个以“Gm02”为内容的文本文件。

散列用于跟踪已经处理过的位置。在上面的示例文件中，在遇到位置 11896804 处的第一个值之前，将检查所有三个文件以从位置 1 到 11896803 计数。这是为了确保在递增位置之前检查所有文件中的所有位置并求和。

我的问题与性能有关。我决定使用 Tie::File 是因为我认为这会提高性能，因为所有文件都不会读入内存。程序要处理的实际数据是数十万行的长度乘以数十个文件。此时，单独运行示例文件 1 和所有 3 个示例文件所花费的时间分别为 42 秒(41.96 usr + 0.00 sys = 41.96 CPU)和 110 秒(109.76 usr + 0.00 sys = 109.76 CPU)。非常感谢任何关于为什么这个程序运行如此缓慢的信息或关于如何加速它的建议。

编辑美国东部时间晚上 10:17:程序的输出如下:

Gm02    0   11896804    2
Gm02    1   11896804    3
Gm02    2   11896804    3
Gm02    8
Gm02    0   11896805    7
Gm02    1   11896805    7
Gm02    2   11896805    7
Gm02    21
Gm02    0   11896806    3
Gm02    1   11896806    3
Gm02    2   11896806    3
Gm02    9
Gm02    0   11896807    2
Gm02    1   11896807    2
Gm02    2   11896807    2
Gm02    6
Gm02    0   11896808    2
Gm02    1   11896808    2
Gm02    2   11896808    2
Gm02    6
Gm02    0   11896809    2
Gm02    1   11896809    2
Gm02    2   11896809    2
Gm02    6
Gm02    0   11896810    2
Gm02    1   11896810    2
Gm02    2   11896810    2
Gm02    6
Gm02    0   11896811    3
Gm02    1   11896811    3
Gm02    2   11896811    3
Gm02    9
Gm02    0   11896812    3
Gm02    1   11896812    3
Gm02    2   11896812    3
Gm02    9
Gm02    1   11896813    3
Gm02    3
Gm02    0   11896822    3
Gm02    3
Gm02    2   11896833    3
Gm02    3
Gm02    0   11896804    2
Gm02    1   11896804    3
Gm02    5
Gm02    0   11896805    7
Gm02    1   11896805    7
Gm02    14
Gm02    0   11896806    3
Gm02    1   11896806    3
Gm02    6
Gm02    0   11896807    2
Gm02    1   11896807    2
Gm02    4
Gm02    0   11896808    2
Gm02    1   11896808    2
Gm02    4
Gm02    0   11896809    2
Gm02    1   11896809    2
Gm02    4
Gm02    0   11896810    2
Gm02    1   11896810    2
Gm02    4
Gm02    0   11896811    3
Gm02    1   11896811    3
Gm02    6
Gm02    0   11896812    3
Gm02    1   11896812    3
Gm02    6
Gm02    1   11896813    3
Gm02    3
Gm02    0   11896822    3
Gm02    3
Gm02    0   11896804    2
Gm02    2
Gm02    0   11896805    7
Gm02    7
Gm02    0   11896806    3
Gm02    3
Gm02    0   11896807    2
Gm02    2
Gm02    0   11896808    2
Gm02    2
Gm02    0   11896809    2
Gm02    2
Gm02    0   11896810    2
Gm02    2
Gm02    0   11896811    3
Gm02    3
Gm02    0   11896812    3
Gm02    3
Gm02    0   11896822    3
Gm02    3

最佳答案

我会说“因为您正在使用 Tie::File”，除非您不在以下代码行之外:

my @file;
tie @file, 'Tie::File', $pileupfiles[$filenum], recsep => "\n"  or die;
push(@tied, [@file]);

你也可以这样写

open(my $fh, '<', $pileupfiles[$filenum]) or die $!;
push(@tied, [ <$fh> ]);

也许你的意思是

tie my @file, 'Tie::File', $pileupfiles[$filenum], recsep => "\n"  or die;
push(@tied, \@file);

然后我们会回到我原来的答案。 Tie::File 在某些情况下可能会减少开发时间，但它不会是目前最快的解决方案，而且它可能会使用所需的更多内存。

顺便说一下，exist 对数组元素没有意义。

if (exists $tied[$filenum][$linenum[$filenum]])

是一种糟糕的做法

if (defined $tied[$filenum][$linenum[$filenum]])

或

if ($linenum[$filenum] < @{ $tied[$filenum] })

关于performance - 为什么我的程序使用 Tie::File 运行得这么慢？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/14820666/

文章推荐： django - 电子邮件，与发件人地址不同的 'reply to' 地址

文章推荐： github - Atom/Github 删除文件结尾换行符

文章推荐： c - getchar() 并逐行读取

文章推荐： internet-explorer-8 - IE8 Javascript Profiler不显示源代码

c++ - 制作自定义类型 "tie-able"(与 std::tie 兼容)
假设我有一个自定义类型(我可以扩展): struct Foo { int a; string b; }; 如何使该对象的实例可分配给 std::tie ，即 std::tuple引用文
sql - TOP n WITH TIES : LIMIT "with ties"? 的 PostgreSQL 等价物
我在 SQL Server 中寻找类似的东西: SELECT TOP n WITH TIES FROM tablename 我知道 PostgreSQL 中的 LIMIT，但是否存在与上述等效的内容？
perl - 如何针对子例程使用 "tied"？
我的脚本似乎有一个小问题，我需要对脚本中较早的子例程调用“tied”，以便我可以访问与哈希绑定(bind)的对象相关的函数到。问题是，当我运行脚本时，它返回错误“无法在 cbc_encrypt_tes
php - "Tie"已上传图片至其余部分
关闭。此题需要details or clarity 。目前不接受答案。想要改进这个问题吗？通过 editing this post 添加详细信息并澄清问题. 已关闭 8 年前。 Improve th
TIE:AFrameworkforEmbedding-basedIncrementalTemporalKnowledgeGraphCompletion增量时序知识图谱补全论文解读
　　论文网址： https://dl.acm.org/doi/10.1145/3404835.3462961 。　　Arxiv： https://arxiv.org/abs/2104.08419
raku - 带内存的标量，或如何正确 `tie`
这是我试图解决 weekly 的挑战 #2 . 挑战很模糊，所以我决定尝试用内存来实现一个标量值。我对容器应该如何工作的理解可能有缺陷，但我真正不明白的是，为什么 say self.VAR.WHAT是
haskell - “tying the knot”的解释
在阅读 Haskell 相关的东西时，我有时会遇到“打结”这个表达，我想我理解它的作用，但不理解它的作用。那么，对于这个概念有什么好的、基本的、简单易懂的解释吗？最佳答案打结是循环数据结构问题的
perl - Tie::IxHash 在哈希的哈希中排序关联数组？
如何保留哈希元素的添加顺序对于第二个VAR？ (哈希值的哈希值) 例如: use Tie::IxHash; my %hash; tie %hash, "Tie::IxHash"; for my $nu
perl - Tie::File 是否延迟加载文件？
我打算编写一个简单的文本查看器，我希望它能够处理非常大的文件。我正在考虑使用 Tie::File为此，并对行进行分页。这是懒惰地加载行，还是一次加载所有行？最佳答案它不会加载整个文件。来自 doc
swift - 用 tie 元素按值排序字典不应更改其顺序
我有一个 [String: Int] 类型的字典，它的值为 let dic = [“a”:4, “b”:3, “c”:3] 我想按值和使用方法对字典进行排序 dic = dic.sorted(by:
c++ - 下一代 std::tie
当一个函数需要返回两个参数时，你可以使用 std::pair 编写它: std::pair f() {return std::make_pair(1,2);} 如果你想使用它，你可以这样写: int
c++ - std::tie 是否允许隐式转换？
在 c++11 中，std::tie 是否允许隐式转换？以下代码编译并运行，但我不确定幕后到底发生了什么，或者这是否安全。 std::tuple foo() { return std::make_t
c++ - std::tie 的异常安全性如何？
std::tie 返回一个引用元组，因此您可以执行以下操作: int foo, bar, baz; std::tie(foo, bar, baz) = std::make_tuple(1, 2, 3)
C#问答游戏: What to do in case of a tie?
我目前正在开发一款问答游戏。我写了一个 Team 类，一个 Question 类和一个 Round 类。这是我的团队类(我不会发布属性、构造函数和方法，因为它们与我的问题无关)。 public cl
c++ - std::tie 语法不清晰
我正在阅读 a bit about tuples . 现在我不清楚以下语法: std::tie (myint, std::ignore, mychar) = mytuple; 理解它的作用并不难，但是
Ruby 相当于 Tie::FIle？
是否有与 Perl 等效的 Ruby Tie::File模块？最佳答案没有。读取文件并将行作为 Array 返回很容易，就像在 Perl 中一样: array = File.readlines(
c++ - std::tie 是如何工作的？
我使用 std::tie 并没有考虑太多。它有效，所以我刚刚接受了这一点: auto test() { int a, b; std::tie(a, b) = std::make_tuple
haskell - 是否可以在使用 tying-the-knot 策略构建的图上进行搜索？
tying-the-knot 策略可用于构建图，例如，使用简单的两条边图作为示例: data Node = Node Node Node -- a - b -- | | -- c - d squa
performance - 为什么我的程序使用 Tie::File 运行得这么慢？
#!/usr/bin/perl use strict; use warnings; use Tie::File; use Data::Dumper; use Benchmark; my $t0 = B
c++ - 与 get、tie 和其他元组操作一起使用的元组包装器
我写了一个花哨的“zip 迭代器”，它已经完成了许多角色(可以用于 for_each、复制循环、容器迭代器范围构造函数等......)。在处理所涉及的对/元组的所有模板代码下，归结为迭代器的解引用运

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

performance - 为什么我的程序使用 Tie::File 运行得这么慢？