gpt4 book ai didi

磁盘上的 Java 数组批量刷新

转载 作者:塔克拉玛干 更新时间:2023-11-02 18:58:25 25 4
gpt4 key购买 nike

我有两个数组(int 和 long),其中包含数百万个条目。到目前为止,我一直在使用 DataOutputStream 并使用长缓冲区,因此磁盘 I/O 成本变低(nio 也或多或少与我有巨大的缓冲区相同,因此 I/O 访问成本低)特别是,使用

DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream("abc.txt"),1024*1024*100));

for(int i = 0 ; i < 220000000 ; i++){
long l = longarray[i];
dos.writeLong(l);
}

但这样做需要几秒钟(超过 5 分钟)。实际上,我想要批量刷新(某种主内存到磁盘内存映射)。为此,我在 here 中找到了一个不错的方法和 here .但是,无法理解如何在我的 javac 中使用它。任何人都可以帮助我解决这个问题或以任何其他方式很好地做到这一点吗?

最佳答案

在我的机器上,3.8 GHz i7 和 SSD

DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream("abc.txt"), 32 * 1024));

long start = System.nanoTime();
final int count = 220000000;
for (int i = 0; i < count; i++) {
long l = i;
dos.writeLong(l);
}
dos.close();
long time = System.nanoTime() - start;
System.out.printf("Took %.3f seconds to write %,d longs%n",
time / 1e9, count);

打印

Took 11.706 seconds to write 220,000,000 longs

使用内存映射文件

final int count = 220000000;

final FileChannel channel = new RandomAccessFile("abc.txt", "rw").getChannel();
MappedByteBuffer mbb = channel.map(FileChannel.MapMode.READ_WRITE, 0, count * 8);
mbb.order(ByteOrder.nativeOrder());

long start = System.nanoTime();
for (int i = 0; i < count; i++) {
long l = i;
mbb.putLong(l);
}
channel.close();
long time = System.nanoTime() - start;
System.out.printf("Took %.3f seconds to write %,d longs%n",
time / 1e9, count);

// Only works on Sun/HotSpot/OpenJDK to deallocate buffer.
((DirectBuffer) mbb).cleaner().clean();

final FileChannel channel2 = new RandomAccessFile("abc.txt", "r").getChannel();
MappedByteBuffer mbb2 = channel2.map(FileChannel.MapMode.READ_ONLY, 0, channel2.size());
mbb2.order(ByteOrder.nativeOrder());
assert mbb2.remaining() == count * 8;
long start2 = System.nanoTime();
for (int i = 0; i < count; i++) {
long l = mbb2.getLong();
if (i != l)
throw new AssertionError("Expected "+i+" but got "+l);
}
channel.close();
long time2 = System.nanoTime() - start2;
System.out.printf("Took %.3f seconds to read %,d longs%n",
time2 / 1e9, count);

// Only works on Sun/HotSpot/OpenJDK to deallocate buffer.
((DirectBuffer) mbb2).cleaner().clean();

在我的 3.8 GHz i7 上打印。

Took 0.568 seconds to write 220,000,000 longs

在较慢的机器上打印

Took 1.180 seconds to write 220,000,000 longs
Took 0.990 seconds to read 220,000,000 longs

Is here any other way not to create that ? Because I have that array already on my main memory and I can't allocate more than 500 MB to do that?

这不会使用少于 1 KB 的堆。如果您查看此调用前后使用了多少内存,您通常会发现根本没有增加。

Another thing, is this gives efficient loading also means MappedByteBuffer?

根据我的经验,使用内存映射文件是迄今为止最快的,因为您减少了系统调用和复制到内存中的次数。

Because, in some article I found read(buffer) this gives better loading performance. (I check that one, really faster 220 million int array -float array read 5 seconds)

我想读那篇文章,因为我从未见过。

Another issue: readLong gives error while reading from your code output file

证明中的部分性能是以 native 字节顺序存储值。 writeLong/readLong 始终使用 big endian 格式,这在 Intel/AMD 系统上要慢得多,这些系统本身就是 little endian 格式。

您可以使字节顺序为大端字节序,这会减慢它的速度,或者您可以使用 native 排序(DataInput/OutputStream 仅支持大端字节序)

关于磁盘上的 Java 数组批量刷新,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10127455/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com