gpt4 book ai didi

performance - Perl 中大型哈希表的快速加载

转载 作者:行者123 更新时间:2023-12-01 17:22:29 24 4
gpt4 key购买 nike

我有大约 30 个具有以下结构的文本文件

wordleft1|wordright1
wordleft2|wordright2
wordleft3|wordright3
...

文件总大小约1GB,包含约3200万行单词组合。

我尝试了几种方法来尽可能快地加载它们并将组合存储在哈希中

$hash{$wordleft} = $wordright

逐个文件打开并逐行读取大约需要 42 秒。然后我使用可存储模块存储哈希

store \%hash, $filename

再次加载数据

$hashref = retrieve $filename

将时间缩短至约 28 秒。我使用快速 SSD 驱动器和快速 CPU,并有足够的 RAM 来保存所有数据(大约需要 7 GB)。

我正在寻找一种更快的方法来将此数据加载到 RAM 中(由于某些原因我无法将其保留在那里)。

最佳答案

您可以尝试使用 Dan Bernstein 的 CDB 文件格式并使用绑定(bind)哈希,这将需要最少的代码更改。您可能需要安装CDB_File 。在我的笔记本电脑上,cdb 文件打开速度非常快,每秒可以执行大约 200-250k 次查找。以下是创建/使用/基准测试 cdb 的示例脚本:

test_cdb.pl

#!/usr/bin/env perl

use warnings;
use strict;

use Benchmark qw(:all) ;
use CDB_File 'create';
use Time::HiRes qw( gettimeofday tv_interval );

scalar @ARGV or die "usage: $0 number_of_keys seconds_to_benchmark\n";
my ($size) = $ARGV[0] || 1000;
my ($seconds) = $ARGV[1] || 10;

my $t0;
tic();

# Create CDB
my ($file, %data);

%data = map { $_ => 'something' } (1..$size);
print "Created $size element hash in memory\n";
toc();

$file = 'data.cdb';
create %data, $file, "$file.$$";
my $bytes = -s $file;
print "Created data.cdb [ $size keys and values, $bytes bytes]\n";
toc();

# Read from CDB
my $c = tie my %h, 'CDB_File', 'data.cdb' or die "tie failed: $!\n";
print "Opened data.cdb as a tied hash.\n";
toc();

timethese( -1 * $seconds, {
'Pick Random Key' => sub { int rand $size },
'Fetch Random Value' => sub { $h{ int rand $size }; },
});

tic();
print "Fetching Every Value\n";
for (0..$size) {
no warnings; # Useless use of hash element
$h{ $_ };
}
toc();

sub tic {
$t0 = [gettimeofday];
}

sub toc {
my $t1 = [gettimeofday];
my $elapsed = tv_interval ( $t0, $t1);
$t0 = $t1;
print "==> took $elapsed seconds\n";
}

输出(100万个 key ,测试超过10秒)

./test_cdb.pl 1000000 10

Created 1000000 element hash in memory
==> took 2.882813 seconds
Created data.cdb [ 1000000 keys and values, 38890944 bytes]
==> took 2.333624 seconds
Opened data.cdb as a tied hash.
==> took 0.00015 seconds
Benchmark: running Fetch Random Value, Pick Random Key for at least 10 CPU seconds...
Fetch Random Value: 10 wallclock secs (10.46 usr + 0.01 sys = 10.47 CPU) @ 236984.72/s (n=2481230)
Pick Random Key: 9 wallclock secs (10.11 usr + 0.02 sys = 10.13 CPU) @ 3117208.98/s (n=31577327)
Fetching Every Value
==> took 3.514183 seconds

输出(1000万个 key ,测试超过10秒)

./test_cdb.pl 10000000 10

Created 10000000 element hash in memory
==> took 44.72331 seconds
Created data.cdb [ 10000000 keys and values, 398890945 bytes]
==> took 25.729652 seconds
Opened data.cdb as a tied hash.
==> took 0.000222 seconds
Benchmark: running Fetch Random Value, Pick Random Key for at least 10 CPU seconds...
Fetch Random Value: 14 wallclock secs ( 9.65 usr + 0.35 sys = 10.00 CPU) @ 209811.20/s (n=2098112)
Pick Random Key: 12 wallclock secs (10.40 usr + 0.02 sys = 10.42 CPU) @ 2865335.22/s (n=29856793)
Fetching Every Value
==> took 38.274356 seconds

关于performance - Perl 中大型哈希表的快速加载,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43125309/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com