python - 为什么a = [0]的list(x for a中的x)比a = []更快？-6ren

python - 为什么a = [0]的list(x for a中的x)比a = []更快？

转载作者：行者123 更新时间：2023-12-03 07:28:22

我使用三种不同的CPython版本测试了list(x for x in a)。在a = [0]上，它比在a = []上快得多:

 3.9.0 64-bit       3.9.0 32-bit       3.7.8 64-bit
a = []  a = [0]    a = []  a = [0]    a = []  a = [0]

465 ns  412 ns     543 ns  515 ns     513 ns  457 ns   
450 ns  406 ns     544 ns  515 ns     506 ns  491 ns   
456 ns  408 ns     551 ns  513 ns     515 ns  487 ns   
455 ns  413 ns     548 ns  516 ns     513 ns  491 ns   
452 ns  404 ns     549 ns  511 ns     508 ns  486 ns

使用 tuple而不是 list，这是预期的其他方法:

 3.9.0 64-bit       3.9.0 32-bit       3.7.8 64-bit
a = []  a = [0]    a = []  a = [0]    a = []  a = [0]

354 ns  405 ns     467 ns  514 ns     421 ns  465 ns   
364 ns  407 ns     467 ns  527 ns     425 ns  464 ns   
353 ns  399 ns     490 ns  549 ns     419 ns  465 ns   
352 ns  400 ns     500 ns  556 ns     414 ns  474 ns   
354 ns  405 ns     494 ns  560 ns     420 ns  474 ns

那么，为什么 list(和底层的生成器迭代器)必须执行更多操作，为什么它更快呢？
在Windows 10 Pro 2004 64位上进行了测试。
基准代码:

from timeit import repeat

setups = 'a = []', 'a = [0]'
number = 10**6

print(*setups, sep='   ')
for _ in range(5):
    for setup in setups:
        t = min(repeat('list(x for x in a)', setup, number=number)) / number
        print('%d ns' % (t * 1e9), end='   ')
    print()

字节大小，表明它不会为输入 []整体分配，但为输入 [0]整体分配:

>>> [].__sizeof__()
40
>>> list(x for x in []).__sizeof__()
40

>>> [0].__sizeof__()
48
>>> list(x for x in [0]).__sizeof__()
72

最佳答案

您观察到的是pymalloc(Python memory manager)比C运行时提供的内存管理器快。
在事件探查器中很容易看出，这两个版本之间的主要区别在于list_resize和_PyObjectRealloc需要更多的时间来处理a=[] -case。但为什么？
从迭代器创建新列表时，该列表会尝试to get a hint迭代器中有多少个元素:

n = PyObject_LengthHint(iterable, 8);

但是，此 doesn't work for generators以及提示是默认值 8。
迭代器用尽后，列表将尝试 to shrink，因为只有0或1个元素(由于大小提示过大而没有分配原始容量)。对于1个元素，这将导致(由于分配过多)4个元素的容量。但是，对于0个元素，有一个特殊的处理方法: not be over-allocated:

// ...
if (newsize == 0)
        new_allocated = 0;
num_allocated_bytes = new_allocated * sizeof(PyObject *);
items = (PyObject **)PyMem_Realloc(self->ob_item, num_allocated_bytes);
// ...

因此，在“空”情况下，将要求 PyMem_Realloc 0字节。该调用将通过 _PyObject_Malloc 传递到 pymalloc_alloc ，如果字节为0，则返回 NULL:

if (UNLIKELY(nbytes == 0)) {
   return NULL;
}

但是，如果 _PyObject_Malloc返回 pymalloc，则将 NULL falls back分配给“原始” malloc:

static void *
_PyObject_Malloc(void *ctx, size_t nbytes)
{
    void* ptr = pymalloc_alloc(ctx, nbytes);
    if (LIKELY(ptr != NULL)) {
        return ptr;
    }

    ptr = PyMem_RawMalloc(nbytes);
    if (ptr != NULL) {
        raw_allocated_blocks++;
    }
    return ptr;
}

从 definition of _PyMem_RawMalloc 可以很容易地看到:

static void *
_PyMem_RawMalloc(void *ctx, size_t size)
{
    /* PyMem_RawMalloc(0) means malloc(1). Some systems would return NULL
       for malloc(0), which would be treated as an error. Some platforms would
       return a pointer with no memory behind it, which would break pymalloc.
       To solve these problems, allocate an extra byte. */
    if (size == 0)
        size = 1;
    return malloc(size);
}

因此，案例 a=[0]将使用 pymalloc，而 a=[]将使用基础c运行时的内存管理器，这将说明观察到的差异。

现在，这一切都可以看作是错过的优化，因为对于 newsize=0，我们可以将 ob_item设置为 NULL，调整其他成员并返回。
让我们尝试一下:

static int
list_resize(PyListObject *self, Py_ssize_t newsize)
{
    // ...
    if (newsize == 0) {
        PyMem_Del(self->ob_item);
        self->ob_item = NULL;
        Py_SIZE(self) = 0;
        self->allocated = 0;
        return 0;
    }
    // ...
}

借助此修复程序，空大小写比预期的 a=[0]大小写要快一些(大约10％)。

我声称 pymalloc在较小的大小上比C运行时内存管理器更快，可以使用 bytes轻松测试:如果需要分配超过512个字节， pymalloc将回退为简单的 malloc:

print(bytes(479).__sizeof__())   #  512
%timeit bytes(479)               # 189 ns ± 20.4 ns
print(bytes(480).__sizeof__())   #  513
%timeit bytes(480)               # 296 ns ± 24.8 ns

实际差异超过了所显示的50％(无法通过仅将一个字节更改大小来解释这一跳跃)，因为至少有一部分时间用于初始化字节对象，依此类推。
这是在cython的帮助下进行的更直接的比较:

%%cython
from libc.stdlib cimport malloc, free
from cpython cimport PyMem_Malloc, PyMem_Del

def with_pymalloc(int size):
    cdef int i
    for i in range(1000):
        PyMem_Del(PyMem_Malloc(size))
        
def with_cmalloc(int size):
    cdef int i
    for i in range(1000):
        free(malloc(size))

现在

%timeit with_pymalloc(1)   #  15.8 µs ± 566 ns
%timeit with_cmalloc(1)    #  51.9 µs ± 2.17 µs

即 pymalloc快大约3倍(或每个分配大约35ns)。注意: some compilers would optimize free(malloc(size))输出，但是 MSVC doesn't。
再举一个例子:前段时间，我已经通过pymalloc替换了默认的分配器，将其分配给导致 a speed up of factor 4的C++的 std::map。

为了分析，使用了以下脚本:

a=[0] # or a=[]
for _ in range(10000000):
    list(x for x in a)

以及在 Release模式下VisualStudio的内置性能分析器。 a=[0] -version需要6.6秒(在分析器中)，而 a=[]版本需要6.9秒(即慢5％)。在“修复”之后， a=[]仅需要5.8秒。
在 list_resize和 _PyObject_Realloc中花费的时间份额:

                     a=[0]          a=[]       a=[], fixed        
list_resize           3.5%          10.2%          3%
_PyObject_Realloc     3.2%           9.3%          1%

显然，每次运行之间存在差异，但是运行时间的差异很明显，可以解释观察到的时间差的大部分。
注意:每个 0.3分配的 10^7秒差异约为30ns-这个数字类似于我们为pymalloc和c-runtime分配之间的差异而获得的数值。

使用调试器验证以上内容时，必须意识到，在 Debug模式下，Python使用了pymalloc的调试版本，该版本将其他数据附加到所需的内存中，因此在调试版本中，永远不会要求pymalloc分配0字节，但是 0 bytes + debug-overhead，不会退回到 malloc。因此，要么在发布版本的 Debug模式下调试，要么在debug-build中切换到realease-pymalloc(可能有一个选项-我只是不知道，代码中的相关部分是 here和 here)。

关于python - 为什么a = [0]的list(x for a中的x)比a = []更快？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/64428208/