python - 为什么 readlines() 读取的内容比 sizehint 多得多？-6ren

python - 为什么 readlines() 读取的内容比 sizehint 多得多？

转载作者：行者123 更新时间：2023-12-01 05:02:34

背景

我正在 Python 2.7.6 中解析非常大的文本文件(30GB+)。为了稍微加快这个过程，我将文件分成 block ，并使用多处理库将它们分配给子进程。为此，我在主进程中迭代文件，记录要分割输入文件的字节位置并将这些字节位置传递给子进程，然后子进程打开输入文件并使用 file.readlines(chunk_size) 读取其 block 。。但是，我发现读入的 block 似乎比 sizehint 大得多(4 倍)。争论。

问题

为什么不注意尺寸提示？

示例代码

以下代码演示了我的问题:

import sys

# set test chunk size to 2KB
chunk_size = 1024 * 2

count = 0
chunk_start = 0
chunk_list = []

fi = open('test.txt', 'r')
while True:
    # increment chunk counter
    count += 1

    # calculate new chunk end, advance file pointer
    chunk_end = chunk_start + chunk_size
    fi.seek(chunk_end)

    # advance file pointer to end of current line so chunks don't have broken 
    # lines
    fi.readline() 
    chunk_end = fi.tell()

    # record chunk start and stop positions, chunk number
    chunk_list.append((chunk_start, chunk_end, count))

    # advance start to current end
    chunk_start = chunk_end

    # read a line to confirm we're not past the end of the file
    line = fi.readline()
    if not line:
        break

    # reset file pointer from last line read
    fi.seek(chunk_end, 0)

fi.close()

# This code represents the action taken by subprocesses, but each subprocess
# receives one chunk instead of iterating the list of chunks itself.
with open('test.txt', 'r', 0) as fi:
    # iterate over chunks
    for chunk in chunk_list:
        chunk_start, chunk_end, chunk_num = chunk

        # advance file pointer to chunk start
        fi.seek(chunk_start, 0)

        # print some notes and read in the chunk
        sys.stdout.write("Chunk #{0}: Size: {1} Start {2} Real Start: {3} Stop {4} "
              .format(chunk_num, chunk_end-chunk_start, chunk_start, fi.tell(), chunk_end))
        chunk = fi.readlines(chunk_end - chunk_start)
        print("Real Stop: {0}".format(fi.tell()))

        # write the chunk out to a file for examination
        with open('test_chunk{0}'.format(chunk_num), 'w') as fo:
            fo.writelines(chunk)

结果

我使用大约 23.3KB 的输入文件 (test.txt) 运行此代码，并生成以下输出:

Chunk #1: Size: 2052 Start 0 Real Start: 0 Stop 2052 Real Stop: 8193
Chunk #2: Size: 2051 Start 2052 Real Start: 2052 Stop 4103 Real Stop: 10248
Chunk #3: Size: 2050 Start 4103 Real Start: 4103 Stop 6153 Real Stop: 12298
Chunk #4: Size: 2050 Start 6153 Real Start: 6153 Stop 8203 Real Stop: 14348
Chunk #5: Size: 2050 Start 8203 Real Start: 8203 Stop 10253 Real Stop: 16398
Chunk #6: Size: 2050 Start 10253 Real Start: 10253 Stop 12303 Real Stop: 18448
Chunk #7: Size: 2050 Start 12303 Real Start: 12303 Stop 14353 Real Stop: 20498
Chunk #8: Size: 2050 Start 14353 Real Start: 14353 Stop 16403 Real Stop: 22548
Chunk #9: Size: 2050 Start 16403 Real Start: 16403 Stop 18453 Real Stop: 23893
Chunk #10: Size: 2050 Start 18453 Real Start: 18453 Stop 20503 Real Stop: 23893
Chunk #11: Size: 2050 Start 20503 Real Start: 20503 Stop 22553 Real Stop: 23893
Chunk #12: Size: 2048 Start 22553 Real Start: 22553 Stop 24601 Real Stop: 23893

报告的每个 block 大小约为 2KB，所有开始/停止位置均按其应有的方式排列，并且 fi.tell() 报告的实际文件位置似乎是正确的，所以我相当确定我的分块算法是好的。然而，真实的停靠位置显示readlines()阅读的内容远不止尺寸提示。另外，输出文件 #1 - #8 为 8.0KB，远大于大小提示。

即使我尝试只破坏行尾的 block 是错误的，readlines()仍然不必读取超过 2KB + 一行。文件 #9 - #12 变得越来越小，这是有道理的，因为 block 起点越来越接近文件末尾，并且 readlines()不会读取超过文件末尾的内容。

注释

我的测试输入文件仅在每行上打印“<行号>\n”，1-5000。
我再次尝试使用不同的 block 和输入文件大小，得到类似的结果。
readlines documentation说读取大小可能会四舍五入为内部缓冲区的大小，因此我尝试在不缓冲的情况下打开文件(如图所示)，但没有什么区别。
我使用这个算法来分割文件，因为我需要能够支持 *.bz2 和 *.gz 压缩文件，而 *.gz 文件无法让我在不解压文件的情况下识别未压缩文件的大小。 *.bz2 文件也没有，但我可以从这些文件末尾查找 0 字节并使用 fi.tell()获取文件大小。请参阅my related question .
在添加支持压缩文件的要求之前，之前版本的脚本使用 os.path.getsize()作为分块循环的停止条件，并且 readlines 似乎可以很好地使用该方法。

最佳答案

缓冲区readlines文档提到与 open 的第三个参数的缓冲无关。调用控制。缓冲区是 this buffer in file_readlines :

static PyObject *
file_readlines(PyFileObject *f, PyObject *args)
{
    long sizehint = 0;
    PyObject *list = NULL;
    PyObject *line;
    char small_buffer[SMALLCHUNK];

哪里SMALLCHUNK之前已定义:

#if BUFSIZ < 8192
#define SMALLCHUNK 8192
#else
#define SMALLCHUNK BUFSIZ
#endif

我不知道在哪里BUFSIZ来自，但看起来您得到的是 #define SMALLCHUNK 8192案件。无论如何，readlines永远不会使用小于 8 KiB 的缓冲区，因此您应该使 block 大于该值。

关于python - 为什么 readlines() 读取的内容比 sizehint 多得多？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25755987/

文章推荐： jquery - 使用 Jquery 验证验证组内的组

文章推荐： python - pyusb 无法访问 OUT 端点

文章推荐： Java:需要有关哈希函数溢出的帮助

文章推荐： symfony - 新的 symfony 包，当

qt - 布局 sizeHint qwidget
我正在创建一个小部件，其中包含一个国际象棋 table 和六个按钮，这些按钮显示在国际象棋 table 下方的 2 行中。按钮行比棋盘宽。国际象棋表和按钮位于 QGridLayout 中。 Chess
qt - QLabel sizehint 太小
我有一个 QAbstractItemDelegate 并且在 paint 方法中，我试图从标签中绘制文本。但我看到的问题是 QLabel 的大小提示对于它包含的文本来说总是太小。我怎样才能解决这个问题
qt - 如何强制 QAbstractItemView 重新计算项目 sizeHints
我在 QSplitter 中有 QListView 和 QTabWidget。 QListView 正在使用自定义模型和自定义委托(delegate)。在委托(delegate)中，我重新实现了 pa
c++ - 如何防止 QListView 调用每个项目的 sizeHint？
我有一个 QListView 有很多不同高度的项目。我实现了一个自定义委托(delegate)来绘制项目并将布局模式设置为批处理。但是，当分配模型时， ListView 会预先为模型中的每个项目请求
python - Pyqt:使用布局管理器在双小部件应用程序上强制执行 sizeHint() 尺寸
我有两个小部件彼此相邻，WidgetA 和 WidgetB，在 QDialog 上，带有水平布局管理器。我正在尝试执行以下尺寸/调整尺寸政策: 对于 WidgetA: 水平方向:宽度应为 900，并且
qt - 启用自动换行时，qlabel 的 sizeHint() 错误
我有一个启用了自动换行的自定义 QLabel。调整 MyWidget 的大小时，它会换行，但 sizeHint() 仍返回原始高度。我尝试了这篇文章中的修复:QLabel cutting off t
c++ - Qt Creator 我无法在小部件属性面板中更改/找到 sizeHint
我正在使用自定义编译的 Qt 库版本 4.8.5 和 MinGW GCC 4.8.2 以及 Qt Creator 3.0.1，我试图在中央小部件下的水平拆分器内设置两个选项卡小部件的 sizeHint
python - Qt FlowLayout 示例 - 如何在布局更改时调用 sizeHint？
我想在布局中排列一堆 QPushButton，以便它们水平换行(如文本)。我正在尝试在 PySide2 中使用 Qt 示例 FlowLayout。我发布了一个简单的示例 revision 2 her
python - 为什么 readlines() 读取的内容比 sizehint 多得多？
背景我正在 Python 2.7.6 中解析非常大的文本文件(30GB+)。为了稍微加快这个过程，我将文件分成 block ，并使用多处理库将它们分配给子进程。为此，我在主进程中迭代文件，记录要分割
ios - 当内容高于 EstimatedSize (sizeHint) 时，为什么单元格不调整大小
我从 Swift 移植了一个带有可变高度单元格的 TableView 实现。但在我的 Xamarin/ReactiveUI 实现中，单元格不会在内容增长时调整大小(即显示可选标签时)。请注意，单元格会
rust - Iterator::unzip 中 SizeHint 的用途是什么？
来自 Rust 标准库 implementation of unzip : fn unzip(self) -> (FromA, FromB) where FromA: Default + Ex
c++ - 来自 QWidget::createWindowContainer 的范式 sizeHint()？
因此，我使用 createWindowContainer 将自定义 QWindow 包装在一个小部件中。默认情况下，这会给出一个无效的大小提示 (-1)，因为 QWindow 不在布局中。此外，QWi
python - 如何在 QItemDelegate sizeHint() 中获取 QTreeView 单元格宽度？
我在 QTreeView 中有一个自定义的 QItemDelegate 绘图文本。在 paint() 中，我从样式中获取单元格的大小。然后，我使用当前单元格的宽度绘制带有自动换行的文本。在sizeH
c++ - 未为 QTableView 行调用 QStyledItemDelegate 的 sizeHint 方法
我有QTableView使用 QSqlQueryModel (它从 SQLite 中获取数据)。有一个QStyledItemDelegate名为 MiniItemDelegate 的子类，我将其用作

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 为什么 readlines() 读取的内容比 sizehint 多得多？

背景

问题

示例代码

结果

注释