Python 与 Perl : performance reading a gzipped file-6ren

Python 与 Perl : performance reading a gzipped file

转载作者：太空狗更新时间：2023-10-29 20:23:58

我有一个包含一百万行的 gzip 数据文件:

$ zcat million_lines.txt.gz | head
1
2
3
4
5
6
7
8
9
10
...

我处理这个文件的 Perl 脚本如下:

# read_million.pl
use strict; 

my $file = "million_lines.txt.gz" ;

open MILLION, "gzip -cdfq $file |";

while ( <MILLION> ) {
    chomp $_; 
    if ($_ eq "1000000" ) {
        print "This is the millionth line: Perl\n"; 
        last; 
    }
}

在 Python 中:

# read_million.py
import gzip

filename = 'million_lines.txt.gz'

fh = gzip.open(filename)

for line in fh:
    line = line.strip()
    if line == '1000000':
        print "This is the millionth line: Python"
        break

无论出于何种原因，Python 脚本花费的时间几乎要长约 8 倍:

$ time perl read_million.pl ; time python read_million.py
This is the millionth line: Perl

real    0m0.329s
user    0m0.165s
sys     0m0.019s
This is the millionth line: Python

real    0m2.663s
user    0m2.154s
sys     0m0.074s

我尝试对这两个脚本进行分析，但实际上没有太多代码需要分析。 Python 脚本大部分时间都花在 for line in fh 上; Perl 脚本大部分时间花在 if($_ eq "1000000") .
现在，我知道 Perl 和 Python 有一些预期的差异。例如，在 Perl 中，我使用 UNIX 的子进程打开文件句柄 gzip命令;在 Python 中，我使用 gzip图书馆。
我能做些什么来加速这个脚本的 Python 实现(即使我从未达到 Perl 性能)？也许 gzip Python 中的模块很慢(或者我使用它的方式很糟糕)；有更好的解决方案吗？
编辑#1
这是 read_million.py逐行分析的样子。

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     2                                           @profile
     3                                           def main():
     4
     5         1            1      1.0      0.0         filename = 'million_lines.txt.gz'
     6         1          472    472.0      0.0         fh = gzip.open(filename)
     7   1000000      5507042      5.5     84.3         for line in fh:
     8   1000000       582653      0.6      8.9                 line = line.strip()
     9   1000000       443565      0.4      6.8                 if line == '1000000':
    10         1           25     25.0      0.0                         print "This is the millionth line: Python"
    11         1            0      0.0      0.0                         break

编辑 #2:
我现在也试过了 subprocess根据@Kirk Strauser 和其他人的python 模块。它更快:
Python“subproc”解决方案:

# read_million_subproc.py 
import subprocess

filename = 'million_lines.txt.gz'
gzip = subprocess.Popen(['gzip', '-cdfq', filename], stdout=subprocess.PIPE)
for line in gzip.stdout: 
    line = line.strip()
    if line == '1000000':
        print "This is the millionth line: Python"
        break
gzip.wait()

这是迄今为止我尝试过的所有事情的比较表:

method                    average_running_time (s)
--------------------------------------------------
read_million.py           2.708
read_million_subproc.py   0.850
read_million.pl           0.393

最佳答案

在测试了许多可能性之后，看起来这里的罪魁祸首是:

比较苹果和橙子:在您最初的测试用例中，Perl 没有执行文件 I/O 或解压工作，gzip程序正在这样做(它是用 C 编写的，所以它运行得非常快)；在该版本的代码中，您将并行计算与串行计算进行比较。

口译启动时间；在绝大多数系统上，Python 需要更长的时间才能开始运行(我相信是因为在启动时加载了更多文件)。我机器上的解释器启动时间大约是挂钟总时间的一半，用户时间的 30%，以及大部分系统时间。在 Python 中完成的实际工作被启动时间淹没，因此您的基准测试既是比较启动时间，也是比较完成工作所需的时间。 后期添加 :您可以通过调用 python 进一步减少 Python 启动的开销。与 -E开关(在启动时禁用对 PYTHON* 环境变量的检查)和 -S开关(禁用自动 import site ，这避免了大量涉及磁盘 I/O 的动态 sys.path 设置/操作，代价是切断对任何非内置库的访问)。

Python的subprocess模块比 Perl 的要高一点 open调用，并在 Python 中实现(在较低级别的原语之上)。广义subprocess代码需要更长的时间来加载(加剧了启动时间问题)并增加了进程启动本身的开销。

Python 2 subprocess默认为无缓冲 I/O，因此除非您传递显式 bufsize，否则您将执行更多系统调用参数(4096 到 8192 似乎工作正常)

line.strip()调用涉及的开销比您想象的要多； Python 中的函数和方法调用的开销比它们真正应该的要高，而且 line.strip()不会改变 str到位方式 Perl 的 chomp确实(因为 Python 的 str 是不可变的，而 Perl 字符串是可变的)

代码的几个版本将绕过大多数这些问题。一、优化 subprocess :

#!/usr/bin/env python

import subprocess

# Launch with subprocess in list mode (no shell involved) and
# use a meaningful buffer size to minimize system calls
proc = subprocess.Popen(['gzip', '-cdfq', 'million_lines.txt.gz'], stdout=subprocess.PIPE, bufsize=4096)
# Iterate stdout directly
for line in proc.stdout:
    if line == '1000000\n':  # Avoid stripping
        print("This is the millionth line: Python")
        break
# Prevent deadlocks by terminating, not waiting, child process
proc.terminate()

其次，纯 Python，主要是基于内置(C 级)API 的代码(它消除了大多数无关的启动开销，并表明 Python 的 gzip 模块与 gzip 程序没有有意义的区别)，以牺牲为代价进行了可笑的微优化可读性/可维护性/简洁性/便携性:

#!/usr/bin/env python

import os

rpipe, wpipe = os.pipe()

def reader():
    import gzip
    FILE = "million_lines.txt.gz"
    os.close(rpipe)
    with gzip.open(FILE) as inf, os.fdopen(wpipe, 'wb') as outf:
        buf = bytearray(16384)  # Reusable buffer to minimize allocator overhead
        while 1:
            cnt = inf.readinto(buf)
            if not cnt: break
            outf.write(buf[:cnt] if cnt != 16384 else buf)

pid = os.fork()
if not pid:
    try:
        reader()
    finally:
        os._exit()

try:
    os.close(wpipe)
    with os.fdopen(rpipe, 'rb') as f:
        for line in f:
            if line == b'1000000\n':
                print("This is the millionth line: Python")
                break
finally:
    os.kill(pid, 9)

在我的本地系统上，在最好的六次运行中， subprocess代码需要:

0.173s/0.157s/0.031s wall/user/sys time.

没有外部实用程序的基于原语的 Python 代码将其归结为以下最佳时间:

0.147s/0.103s/0.013s

(尽管这是一个异常值；一个好的挂钟时间通常更像是 0.165)。添加 -E -S通过消除设置导入机器以处理非内置函数的开销，调用又减少了 0.01-0.015 秒的挂钟和用户时间；在其他评论中，您提到您的 Python 需要将近 0.6 秒才能启动，但什么也不做(但其他方面的表现似乎与我的相似)，这可能表明您在非默认包或环境方面有更多的了解定制正在进行中，以及 -E -S可能会为您节省更多。

Perl 代码，未修改您给我的内容(除了使用 3+ arg open 删除字符串解析和存储从 pid 返回的 open 到明确 kill 在退出之前)有最好的时间: