python - PyPy 比 Python 快 17 倍。 Python可以加速吗？-6ren

python - PyPy 比 Python 快 17 倍。 Python可以加速吗？

转载作者：太空宇宙更新时间：2023-11-03 12:54:36

解决最近的一个 Advent of Code problem ，我发现我的默认 Python 比 PyPy 慢约 40 倍。我能够使用 this code 将其降低到大约 17 倍。通过限制对 len 的调用并通过在函数中运行来限制全局查找。

现在，e.py在 python 3.6.3 上运行 5.162 秒，在我机器上的 PyPy 上运行 0.297 秒。

我的问题是:这是 JIT 不可减少的加速，还是有什么方法可以加速 CPython 的答案？ (缺乏极端手段:我可以去 Cython/Numba 或其他什么？)我如何说服自己我无能为力？

请参阅数字输入文件列表的要点。

如 the problem statement 中所述，它们代表跳跃偏移量。 position += offsets[current] ，并将当前偏移量增加 1。当跳转将您带到列表之外时，您就完成了。

这是给出的示例(需要 5 秒的完整输入要长得多，并且数字更大):

(0) 3  0  1  -3  - before we have taken any steps.
(1) 3  0  1  -3  - jump with offset 0 (that is, don't jump at all). Fortunately, the instruction is then incremented to 1.
 2 (3) 0  1  -3  - step forward because of the instruction we just modified. The first instruction is incremented again, now to 2.
 2  4  0  1 (-3) - jump all the way to the end; leave a 4 behind.
 2 (4) 0  1  -2  - go back to where we just were; increment -3 to -2.
 2  5  0  1  -2  - jump 4 steps forward, escaping the maze.

编码:

def run(cmds):
    location = 0
    counter = 0
    while 1:
        try:
            cmd = cmds[location]
            if cmd >= 3:
                cmds[location] -= 1
            else:
                cmds[location] += 1
            location += cmd
            if location < 0:
                print(counter)
                break
            counter += 1
        except:
            print(counter)
            break

if __name__=="__main__":
    text = open("input.txt").read().strip().split("\n")
    cmds = [int(cmd) for cmd in text]
    run(cmds)

编辑:我用 Cython 编译并运行了代码，将运行时间降低到 2.53 秒，但这仍然比 PyPy 慢了几乎一个数量级。

编辑: Numba gets me to within 2x

编辑:最好的 Cython I could write下降到 1.32 秒，比 pypy 快 4 倍多一点

编辑:移动 cmd正如@viraptor 所建议的那样，将变量转换为 cdef，将 Cython 降低到 0.157 秒!快了近一个完整数量级，并且与常规 python 相差不远。尽管如此，PyPy JIT 给我留下了深刻的印象，它免费完成了所有这些!

最佳答案

作为 Python 的基线，我用 C 编写了它(然后决定使用 C++ 来实现更方便的数组 I/O)。它使用 clang++ 为 x86-64 高效编译。这运行 在 Skylake x86 上使用问题中的代码比 CPython3.6.2 快 82 倍 ，因此即使您的 Python 版本更快，与接近最佳机器代码的速度仍有一些距离。 (是的，我查看了编译器的 asm 输出以检查它是否做得很好)。

让一个好的 JIT 或提前编译器看到循环逻辑是这里性能的关键。 问题逻辑本质上是串行的，因此没有范围让 Python 运行已经编译的 C 来对整个数组(如 NumPy)执行某些操作，因为除非您使用 Cython 或其他东西，否则不会针对此特定问题编译 C .让问题的每一步都回到 CPython 解释器是性能的死亡，因为它的缓慢并没有被内存瓶颈或任何东西所掩盖。

更新:将偏移量数组转换为指针数组将其速度提高 1.5 倍 (简单寻址模式 + 从关键路径循环携带的依赖链中删除 add，将其降低到 4 cycle L1D load-use latency 用于简单寻址模式( when the pointer comes from another load )，而不是 6c = 5c + 1c 用于索引寻址模式 + add 延迟)。

但我认为我们可以对 Python 大方，不要指望它跟上算法转换以适应现代 CPU :P(特别是因为即使在 64 位模式下我也使用 32 位指针来确保 4585 元素数组仍然只是18kiB 所以它适合 32kiB L1D 缓存。就像 Linux x32 ABI 或 AArch64 ILP32 ABI 一样。)

此外，更新的替代表达式让 gcc 编译它，就像 clang 一样。 (注释掉并且原始 perf stat 输出留在这个答案中，因为有趣的是看到无分支与有错误预测的分支的效果。)

unsigned jumps(int offset[], unsigned size) {
    unsigned location = 0;
    unsigned counter = 0;

    do {
          //location += offset[location]++;            // simple version
          // >=3 conditional version below

        int off = offset[location];

        offset[location] += (off>=3) ? -1 : 1;       // branchy with gcc
        // offset[location] = (off>=3) ? off-1 : off+1;  // branchless with gcc and clang.  

        location += off;

        counter++;
    } while (location < size);

    return counter;
}

#include <iostream>
#include <iterator>
#include <vector>

int main()
{
    std::ios::sync_with_stdio(false);     // makes cin faster
    std::istream_iterator<int> begin(std::cin), dummy;
    std::vector<int> values(begin, dummy);   // construct a dynamic array from reading stdin

    unsigned count = jumps(values.data(), values.size());
    std::cout << count << '\n';
}

随着clang4.0.1 -O3 -march=skylake ，内循环是无分支的；它对 >=3 使用条件移动健康)状况。我用过 ? :因为那是我希望编译器会做的。 Source + asm on the Godbolt compiler explorer

.LBB1_4:                                # =>This Inner Loop Header: Depth=1
    mov     ebx, edi               ; silly compiler: extra work inside the loop to save code outside
    mov     esi, dword ptr [rax + 4*rbx]  ; off = offset[location]
    cmp     esi, 2
    mov     ecx, 1
    cmovg   ecx, r8d               ; ecx = (off>=3) ? -1 : 1;  // r8d = -1 (set outside the loop)
    add     ecx, esi               ; off += -1 or 1
    mov     dword ptr [rax + 4*rbx], ecx  ; store back the updated off
    add     edi, esi               ; location += off  (original value)
    add     edx, 1                 ; counter++
    cmp     edi, r9d
    jb      .LBB1_4                ; unsigned compare against array size

这是 perf stat ./a.out < input.txt 的输出(对于 clang 版本)，在我的 i7-6700k Skylake CPU 上:

21841249        # correct total, matches Python

 Performance counter stats for './a.out':

         36.843436      task-clock (msec)         #    0.997 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               119      page-faults               #    0.003 M/sec                  
       143,680,934      cycles                    #    3.900 GHz                    
       245,059,492      instructions              #    1.71  insn per cycle         
        22,654,670      branches                  #  614.890 M/sec                  
            20,171      branch-misses             #    0.09% of all branches        

       0.036953258 seconds time elapsed

由于循环中的数据相关性，平均每时钟指令数远低于 4。下一次迭代的加载地址取决于本次迭代的加载+添加，乱序执行无法隐藏这一点。但是，它可以重叠更新当前位置值的所有工作。

更改自 int至 short没有性能下降(正如预期的那样； movsx has the same latency as mov on Skylake)，但内存消耗减半，因此如果需要，更大的阵列可以放入 L1D 缓存。

我尝试将数组编译到程序中(如 int offsets[] = { file contents with commas added }; 所以它不必被读取和解析。它还使大小成为编译时常量。这将运行时间减少到 ~36.2 +/- 0.1 毫秒，从 ~36.8 下降，所以从文件读取的版本仍然把大部分时间花在实际问题上，而不是解析输入。(与 Python 不同，C++ 的启动开销可以忽略不计，我的 Skylake CPU 上升到最大时钟速度由于 Skylake 中的硬件 P 状态管理，在不到一毫秒的时间内完成。)

如前所述，使用简单的寻址模式(如 [rdi])进行指针追踪而不是 [rdi + rdx*4]具有 1c 更低的延迟，并避免了 add ( index += offset 变成 current = target )。 Intel 因为 IvyBridge 具有零延迟整数 mov寄存器之间，这样就不会延长关键路径。这是 the source (with comments) + asm for this hacky optimization .典型的运行(文本解析为 std::vector ): 23.26 +- 0.05 ms , 90.725 M 周期 (3.900 GHz), 288.724 M instructions (3.18 IPC)。有趣的是，它实际上是更多的总指令，但由于循环携带的依赖链的延迟较低，因此运行速度要快得多。

gcc 使用一个分支，它大约慢 2 倍。 (14% 的分支根据 perf stat 在整个程序中被错误预测。 它只是作为更新值的一部分的分支，但分支未命中会导致管道停滞，因此它们也会影响关键路径，以数据依赖的方式不要在这里。这似乎是优化器的一个糟糕决定。)

将条件重写为 offset[location] = (off>=3) ? off-1 : off+1;说服 gcc 使用看起来不错的 asm 去无分支。

gcc7.1.1 -O3 -march=skylake (对于使用 (off <= 3) ? : -1 : +1 的分支编译的原始源代码)。

Performance counter stats for './ec-gcc':

     70.032162      task-clock (msec)         #    0.998 CPUs utilized          
             0      context-switches          #    0.000 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
           118      page-faults               #    0.002 M/sec                  
   273,115,485      cycles                    #    3.900 GHz                    
   255,088,412      instructions              #    0.93  insn per cycle         
    44,382,466      branches                  #  633.744 M/sec                  
     6,230,137      branch-misses             #   14.04% of all branches        

   0.070181924 seconds time elapsed

对比 CPython(Arch Linux 上的 Python3.6.2) :

perf stat python ./orig-v2.e.py
21841249

 Performance counter stats for 'python ./orig-v2.e.py':

       3046.703831      task-clock (msec)         #    1.000 CPUs utilized          
                10      context-switches          #    0.003 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               923      page-faults               #    0.303 K/sec                  
    11,880,130,860      cycles                    #    3.899 GHz                    
    38,731,286,195      instructions              #    3.26  insn per cycle         
     8,489,399,768      branches                  # 2786.421 M/sec                  
        18,666,459      branch-misses             #    0.22% of all branches        

       3.046819579 seconds time elapsed

我没有安装 PyPy 或其他 Python 实现，抱歉。

关于python - PyPy 比 Python 快 17 倍。 Python可以加速吗？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47666565/

文章推荐： java - 如何以编程方式获取 Glass 序列号

文章推荐： c# - 在上下文菜单中设置文本框的焦点 - wpf

python - Python 中的集群或合并集群以减少组数 (Python)
我正在处理一组标记为 160 个组的 173k 点。我想通过合并最接近的(到 9 或 10 个组)来减少组/集群的数量。我搜索过 sklearn 或类似的库，但没有成功。我猜它只是通过 knn 聚类
python - python 列表的子集基于同一列表的元素组，pythonically
我有一个扁平数字列表，这些数字逻辑上以 3 为一组，其中每个三元组是 (number, __ignored, flag[0 or 1])，例如: [7,56,1, 8,0,0, 2,0,0, 6,1,
python - 激活 Python 虚拟环境并在另一个 Python 脚本中调用 Python 脚本
我正在使用 pipenv 来管理我的包。我想编写一个 python 脚本来调用另一个使用不同虚拟环境(VE)的 python 脚本。如何运行使用 VE1 的 python 脚本 1 并调用另一个 p
python - 在焕然一新的 Python 环境中以编程方式从 Python 内部执行 Python 文件
假设我有一个文件 script.py 位于 path = "foo/bar/script.py"。我正在寻找一种在 Python 中通过函数 execute_script() 从我的主要 Python
python - 从 python 脚本但在 python 脚本之外运行 python 脚本
这听起来像是谜语或笑话，但实际上我还没有找到这个问题的答案。问题到底是什么？我想运行 2 个脚本。在第一个脚本中，我调用另一个脚本，但我希望它们继续并行，而不是在两个单独的线程中。主要是我不希望第
python - 使用不同的 python 从 python 运行 python 脚本
我有一个带有 python 2.5.5 的软件。我想发送一个命令，该命令将在 python 2.7.5 中启动一个脚本，然后继续执行该脚本。我试过用 #!python2.7.5 和http://re
python - 为什么从 Python 命令行调用 Python 时 Python 无法找到并运行我的脚本？
我在 python 命令行(使用 python 2.7)中，并尝试运行 Python 脚本。我的操作系统是 Windows 7。我已将我的目录设置为包含我所有脚本的文件夹，使用: os.chdir("
python - 使用动态版本的 Python 执行嵌入的 Python 代码时出现致命的 Python 错误
剧透:部分解决(见最后)。以下是使用 Python 嵌入的代码示例: #include int main(int argc, char** argv) { Py_SetPythonHome
python - python 中识别 python 数组或列表中最大累积差异的最快方法是什么？
假设我有以下列表，对应于及时的股票价格: prices = [1, 3, 7, 10, 9, 8, 5, 3, 6, 8, 12, 9, 6, 10, 13, 8, 4, 11] 我想确定以下总体上最
python - (Python) 通过单选按钮 python 更新背景
所以我试图在选择某个单选按钮时更改此框架的背景。我的框架位于一个类中，并且单选按钮的功能位于该类之外。 (这样我就可以在所有其他框架上调用它们。) 问题是每当我选择单选按钮时都会出现以下错误: co
python - python 中的字符串与正则表达式比较在 python 中失败
我正在尝试将字符串与 python 中的正则表达式进行比较，如下所示， #!/usr/bin/env python3 import re str1 = "Expecting property name
python - python 如何加载Boost.Python 库？
考虑以下原型(prototype) Boost.Python 模块，该模块从单独的 C++ 头文件中引入类“D”。 /* file: a/b.cpp */ BOOST_PYTHON_MODULE(c)
python - python 检查模块 python 的问题
如何编写一个程序来“识别函数调用的行号？” python 检查模块提供了定位行号的选项，但是， def di(): return inspect.currentframe().f_back.f_l
python - 系统 python 与用户 python
我已经使用 macports 安装了 Python 2.7，并且由于我的 $PATH 变量，这就是我输入 $ python 时得到的变量。然而，virtualenv 默认使用 Python 2.6，除
python - [Python] : Python re. 长字符串行的搜索速度优化
我只想问如何加快 python 上的 re.search 速度。我有一个很长的字符串行，长度为 176861(即带有一些符号的字母数字字符)，我使用此函数测试了该行以进行研究: def getExe
python - 编辑字符串 python 正则表达式 python
list1= [u'%app%%General%%Council%', u'%people%', u'%people%%Regional%%Council%%Mandate%', u'%ppp%%Ge
python - Python 映射中的副作用(Python "do" block )
这个问题在这里已经有了答案: Is it Pythonic to use list comprehensions for just side effects? (7 个答案) 关闭 4 个月前。告
python - 使用其值逻辑组合两个 python 列表 - Python
我想用 Python 将两个列表组合成一个列表，方法如下: a = [1,1,1,2,2,2,3,3,3,3] b= ["Sun", "is", "bright", "June","and" ,"Ju
python - Boost.Python python 链接错误
我正在运行带有最新 Boost 发行版 (1.55.0) 的 Mac OS X 10.8.4 (Darwin 12.4.0)。我正在按照说明 here构建包含在我的发行版中的教程 Boost-Pyth
python - 在 Python 中仅使用内置库制作一个基本的网络抓取工具 - Python
学习 Python，我正在尝试制作一个没有任何第 3 方库的网络抓取工具，这样过程对我来说并没有简化，而且我知道我在做什么。我浏览了一些在线资源，但所有这些都让我对某些事情感到困惑。 html 看起来

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - PyPy 比 Python 快 17 倍。 Python可以加速吗？