JavaIO : Memory/Performance issue when reading “last lines” of files using bufferedReader-6ren

JavaIO : Memory/Performance issue when reading “last lines” of files using bufferedReader

转载作者：太空宇宙更新时间：2023-11-04 12:23:14

我有两个简单的示例文件，具有以下数据结构:
人.csv

0|John
1|Maria
2|Anne

和

项目.csv

0|car|blue
0|bycicle|red
1|phone|gold
2|purse|black
2|book|black

我需要收集所有文件的所有相关行(具有相同标识的行，在本例中为整数 0、1 或 2)，并在收集它们后对它们执行某些操作(与此问题无关)。第一组相关行(字符串列表)应如下所示:

0|John
0|car|blue
0|bycicle|red

第二组相关行:

1|Maria
1|phone|gold

等等

每个文件的实际文件大小约为 5 到 10GB。文件按第一列排序，并且首先打开 id 最小的文件进行读取。内存是一个限制因素(无法读取内存中的整个文件)。考虑到这一点，我编写了以下代码，该代码似乎可以很好地读取大部分行，然后按照我想要的方式对它们进行分组……但是，最后一部分(在我的代码中，我将日志记录计数设置为 250.000 组)花费的时间明显更长，并且内存使用量激增。

主要

public class Main {

    private static int groupCount = 0;
    private static int totalGroupCount = 0;
    private static long start = 0;
    private static int lineCount;

    public static void main(String[] args) {
        GroupedReader groupedReader = new GroupedReader();
        groupedReader.orderReadersOnSmallestId();
        long fullStart = System.currentTimeMillis();
        start = System.currentTimeMillis();
        lineCount = 0;
        while (groupedReader.hasNext()) {
            groupCount++;
            List<String> relatedLines = groupedReader.readNextGroup();
            for (String line : relatedLines) {
                lineCount++;
            }
            totalGroupCount++;
            if (groupCount == 250_000) {
                System.out.println("Building " + NumberFormat.getNumberInstance(Locale.US).format(groupCount) + " groups took " + (System.currentTimeMillis() - start) / 1e3 + " sec");
                groupCount = 0;
                start = System.currentTimeMillis();
            }
        }
        System.out.println("Building " + NumberFormat.getNumberInstance(Locale.US).format(groupCount) + " groups took " + (System.currentTimeMillis() - start) / 1e3 + " sec");
        System.out.println(String.format("Building [ %s ] groups from [ %s ] lines took %s seconds", NumberFormat.getNumberInstance(Locale.US).format(totalGroupCount), NumberFormat.getNumberInstance(Locale.US).format(lineCount), (System.currentTimeMillis() - fullStart) / 1e3));
        System.out.println("all done!");
    }
}

GroupedReader ...省略了一些方法

public class GroupedReader {

    private static final String DELIMITER = "|";
    private static final String INPUT_DIR = "src/main/resources/";

    private boolean EndOfFile = true;
    private List<BufferedReader> sortedReaders;
    private TreeMap<Integer, List<String>> cachedLines;
    private List<String> relatedLines;
    private int previousIdentifier;

    public boolean hasNext() {
        return (sortedReaders.isEmpty()) ? false : true;
    }

    public List<String> readNextGroup() {
        updateCache();
        EndOfFile = true;
        for (int i = 0; i < sortedReaders.size(); i++) {
            List<String> currentLines = new ArrayList<>();
            try {
                BufferedReader br = sortedReaders.get(i);
                for (String line; (line = br.readLine()) != null;) {
                    int firstDelimiterIndex = StringUtils.ordinalIndexOf(line, DELIMITER, 1);
                    int currentIdentifier = Integer.parseInt(line.substring(0, firstDelimiterIndex));
                    if (previousIdentifier == -1) {
                        // first iteration
                        previousIdentifier = currentIdentifier;
                        relatedLines.add(i + DELIMITER + line);
                        continue;
                    } else if (currentIdentifier > previousIdentifier) {
                        // next identifier, so put the lines in the cache
                        currentLines.add(i + DELIMITER + line);
                        if (cachedLines.get(currentIdentifier) != null) {
                            List<String> local = cachedLines.get(currentIdentifier);
                            local.add(i + DELIMITER + line);
                        } else {
                            cachedLines.put(currentIdentifier, currentLines);
                        }
                        EndOfFile = false;
                        break;
                    } else {
                        // same identifier
                        relatedLines.add(i + DELIMITER + line);
                    }
                }
                if (EndOfFile) {
                    // is this close needed?
                    br.close();
                    sortedReaders.remove(br);
                }
            } catch (NumberFormatException | IOException e) {
                e.printStackTrace();
            }
        }
        if (cachedLines.isEmpty()) cachedLines = null;
        return relatedLines;
    }

    private void updateCache() {
        if (cachedLines != null) {
            previousIdentifier = cachedLines.firstKey();
            relatedLines = cachedLines.get(cachedLines.firstKey());
            cachedLines.remove(cachedLines.firstKey());
        } else {
            previousIdentifier = -1;
            relatedLines = new ArrayList<>();
            cachedLines = new TreeMap<>();
            // root of all evil...?
            System.gc();
        }
    }
}

我尝试过“玩弄”显式关闭读取器并调用垃圾收集器，但我无法发现我编写的代码中的实际缺陷。

问题:
是什么导致接近文件末尾的读取速度变慢？

简单的系统日志:

Building 250,000 groups took 0.394 sec
Building 250,000 groups took 0.261 sec
Building 250,000 groups took 0.289 sec
...
Building 250,000 groups took 0.281 sec
Building 250,000 groups took 0.314 sec
Building 211,661 groups took 10.829 sec
Building [ 9,961,661 ] groups from [ 31,991,125 ] lines took 21.016 seconds
all done!

最佳答案

System.gc()是一个请求，但不保证GC会发生。

如果您想要一种快速的方法来查看时间花在哪里，请在代码中的更多点添加更多日志记录，并将 groupCount 减少到较小的数量以查看更好的时间分割(10000？)。

如果您想正确分析并获得更好的理解，请使用 JDK 附带的工具，或者较旧的 visualvm或新的mission control .

两者都可以在 JDK 安装的 bin 文件夹中找到。

关于JavaIO : Memory/Performance issue when reading “last lines” of files using bufferedReader，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38618793/

文章推荐： html - Bootstrap 轮播控件在悬停时消失

文章推荐： c++ - 浮点比较 - 不同运行之间的结果

文章推荐： css - 在 Bootstrap 3 中创建一个等同于 5 个 div 的 div

文章推荐： c++ - 调用栈地址

file - access to file to files tomcat的conf文件夹下的一个文件
我想知道是否可以访问放在 tomcat 的 conf 文件夹中的文件。通常我会在这个文件中放置多个 webapp 的配置，在 war 之外。我想使用类路径独立于文件系统。我过去使用过 lib 文件
PowerShell ForEach $file in $Files 中的每个 $file
我有一个 PowerShell 脚本，它获取文件列表并移动满足特定条件的文件。为什么即使对象为空，foreach 循环也会运行？我假设如果 $i 不存在，它就不会运行。但是如果 $filePath
java - File file = new File () 的路径错误
我已将 BasicAccountRule.drl 放置在我的 Web 应用程序中，位置为:C:/workspace/exim_design/src/main/resources/rules/drl/i
ruby - File.open ('file.txt' ) 与 File.open ('file.txt' ).readlines
我使用 File.open('file.txt').class 和 File.open('file.txt').readlines.class 以及前者进行了检查一个返回 File，后者返回 Arra
java - 即使 file.exists()、file.canRead()、file.canWrite()、file.canExecute() 都返回 true，file.delete() 也会返回 false
我正在尝试使用 FileOutputStream 删除文件，在其中写入内容后。这是我用来编写的代码: private void writeContent(File file, String fileC
python - FileNotFoundException :File file:/path/to/file/in. txt不存在或者运行Flink的用户没有足够的权限访问它
我正在尝试使用 flink 和 python 批处理 api 测试 Wordcount 经典示例。我的问题是，将数据源从 env.from_elements() 修改为 env.read_text()
c - 通过函数 : FILE* or FILE**? 的 FILE* 数组
我正在尝试制作一个可以同时处理多个不同文件的程序。我的想法是制作一个包含 20 个 FILE* 的数组，以便在我达到此限制时能够关闭其中一个并打开请求的新文件。为此，我想到了一个函数，它选择一个选项
linux - 狂欢 : Search Contents of File A in File B and Print lines of File A in File C
我有两个文件A和B文件A: 976464 792992 文件B TimeStamp,Record1,976464,8383,ABCD 我想搜索文件 A 和文件 B 中的每条记录并打印匹配的记录。打印的
java - 使用 Java 8 流将 Map 转换为 Map>
我有一些保存在 map 中的属性文件。示例: Map map = new HashMap<>(); map.put("1", "One"); map.put("2", "Two"); map.put(
file - Unix/庆典 : Reading A List of Files and Merge Them To A File
我正在尝试找出一个脚本文件，该文件接受一个包含文件列表的文件(每一行都是一个文件路径，即 path/to/file)并将它们合并到一个文件中。例如: list.text -- path/to/fil
c# - File.CreateText/File.AppendText 与 File.AppendAllText
为了使用 File.CreateText() 和 File.AppendText() 你必须: 通过调用这些方法之一打开流写消息关闭流处理流为了使用 File.AppendAllText()
Using rsync to rename files during copying with --files-from?(在复制过程中使用rsync重命名文件--files-from？)
使用rsync时，如何在使用--files-from参数复制时重命名文件？我有大约190，000个文件，在从源复制到目标时，每个文件都需要重命名。我计划将文件列表放在一个文本文件中传递给--files
java - "file:d:\\dir1\file.xml"和 "file:/d:\\dir1\file.xml"作为 FileSystemXmlApplicationContext 参数
我在非服务器应用程序中使用 Spring(只需从 Eclipse 中某个类的 main() 编译并运行它)。我的问题是作为 new FileSystemXmlApplicationContext 的
ksh - "test -a file"和 "test file -ef file"的区别
QNX (Neutrino 6.5.0) 使用 ksh 的开源实现作为其 shell 。许多提供的脚本，包括系统启动脚本，都使用诸如 if ! test /dev/slog -ef /dev/slog
PHP : Excel cannot open the file because the file format or file extension is not valid
当我尝试打开从我的应用程序下载的 xls 文件时，出现此错误: excel cannot open the file because the file format or file extension
c - "file pointer"、 "stream"、 "file descriptor"和... "file"之间的区别？
有一些相关的概念，即文件指针、流和文件描述符。我知道文件指针是指向数据类型 FILE 的指针(在例如 FILE.h 和 struct_FILE.h 中声明)。我知道文件描述符是 int ，例如成员
file - Groovy(文件IO): find all files and return all files - the Groovy way
好吧，这应该很容易... 我是groovy的新手，我希望实现以下逻辑: def testFiles = findAllTestFiles(); 到目前为止，我想出了下面的代码，该代码可以成功打印所有文
PowerShell:为什么 "Get-Content | Out-File -Append "会进入循环？
我理解为什么以下内容会截断文件的内容: Get-Content | Out-File 这是因为 Out-File 首先运行，它会在 Get-Content 有机会读取文件之前清空文件。但是当我尝
file - 类型错误 : invalid file: When trying to make a file name a variable
您好，我正在尝试将文件位置表示为变量，因为最终脚本将在另一台机器上运行。这是我尝试过的代码，然后是我得到的错误。在我看来，python 是如何添加“\”的，这就是导致问题的原因。如果是这种情况，我如何
bash - 一行文件的 "$(cat file)"、 "$(
我有一个只包含一行的输入文件: $ cat input foo bar 我想在我的脚本中使用这一行，据我所知有 3 种方法: line=$(cat input) line=$( input"...,

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

JavaIO : Memory/Performance issue when reading “last lines” of files using bufferedReader