Vectorizing taking longer than loop(矢量化比循环花费的时间更长)-6ren

Vectorizing taking longer than loop(矢量化比循环花费的时间更长)

转载作者：bug小助手更新时间：2023-10-25 22:00:47

My function that computes Lorentzian given freq, fwhm, amp. I want to vectorize it so that it does the computation for a list of freqs, fwhms and amps:

我的函数计算洛伦兹给定频率，fwhm，amp。我想对它进行矢量化，以便它计算频率，fwhm和amps的列表：

def lorz1(freq_series, freq, fwhm, amp):
    numerator   = fwhm
    denominator = (2*np.pi) * ((freq_series[:,None] - freq)**2 + fwhm**2/4)
    lor         = numerator / denominator
    main_peak   = amp*(lor/np.linalg.norm(lor, axis=0))
    return np.sum(main_peak, axis=1)


def lorz2(freq_series, freq, fwhm, amp):
    numerator   = fwhm[:,None]
    denominator = (2*np.pi) * ((freq_series - freq[:,None])**2 + fwhm[:,None]**2/4)
    lor         = numerator / denominator
    main_peak   = amp[:,None]*(lor/np.linalg.norm(lor, axis=1)[:,None])
    return np.sum(main_peak, axis=0)


def lorz3(freq_series, freq, fwhm, amp):
    numerator   = fwhm
    denominator = (2*np.pi) * ((freq_series - freq)**2 + fwhm**2/4)
    lor         = numerator / denominator
    main_peak   = amp*(lor/np.linalg.norm(lor))
    return main_peak


series = np.linspace(0,100,50000)
freq   = np.random.uniform(5,50,50)
fwhm   = np.random.uniform(0.01,0.05,50)
amps   = np.random.uniform(5,500,50)

Timing:

计时：

%timeit lorz1(series, freq, fwhm, amps)

38.4 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit lorz2(series, freq, fwhm, amps)

29.8 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np.sum(np.array([lorz3(series, item1, item2, item3)
                         for (item1,item2,item3) in zip(freq, fwhm, amps)]), axis=0)

24.1 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Where am I going wrong with the vectorization in lorz1 and lorz2? Aren't they supposed to be faster than lorz3?

我在lorz1和lorz2中的向量化哪里出了问题？他们不是应该比洛兹3更快吗？

更多回答

The result for np.sum(np.array([lorz3(... is different from the other two. Is that by design?

Np.sum(np.array([lorz3(...与其他两个不同。这是故意的吗？

My bad, the results for the first two are supposed to be the same as the third. I actually did not realize they were different. Is that where the problem is? Assume the third is "correct" and i want to vectorize that operation instead of passing the arguments one by one in a for loop.

我的错，前两个的结果应该和第三个一样。我真的没有意识到他们是不同的。这就是问题所在吗假设第三个是“正确的”，我想对这个操作进行向量化，而不是在for循环中逐个传递参数。

@Homer512 Fixed that problem.

@Hmer 512解决了这个问题。

My working hypothesis (untested): Something like freq_series - freq[:,None] generates an array shaped (50, 50000). That's about 19 MiB of memory. Your partially vectorized lorz3 probably fits into your CPU cache much better than the fully vectorized lorz2

我的工作假设(未经测试)：类似freq_Series-freq[：，None]的代码会生成一个形如(50,50000)的数组。这大约是19MiB的内存。部分矢量化的lorz3可能比完全矢量化的lorz2更适合您的cpu缓存。

@RomanPerekhrest That's just folding the outer loop (in the last timeit statement) into the function. The total number of operations should be the same unless I'm missing something

@RomanPerekhrest，它只是将外部循环(在最后一条timeit语句中)折叠到函数中。手术的总数应该是相同的，除非我遗漏了什么

优秀答案推荐

I did some further profiling using two versions:

我使用两个版本做了进一步的分析：

Version 1:

版本1：

#!/usr/bin/env python3

import numpy as np


def lorz2(freq_series, freq, fwhm, amp):
    numerator   = fwhm[:,None]
    denominator = (2*np.pi) * ((freq_series - freq[:,None])**2 + fwhm[:,None]**2/4)
    lor         = numerator / denominator
    main_peak   = amp[:,None]*(lor/np.linalg.norm(lor, axis=1)[:,None])
    return np.sum(main_peak, axis=0)


series = np.linspace(0,100,50000)
freq   = np.random.uniform(5,50,50)
fwhm   = np.random.uniform(0.01,0.05,50)
amps   = np.random.uniform(5,500,50)

for _ in range(100):
    lorz2(series, freq, fwhm, amps)

and version 2:

和版本2：

#!/usr/bin/env python3


import numpy as np


def lorz3(freq_series, freq, fwhm, amp):
    numerator   = fwhm
    denominator = (2*np.pi) * ((freq_series - freq)**2 + fwhm**2/4)
    lor         = numerator / denominator
    main_peak   = amp*(lor/np.linalg.norm(lor))
    return main_peak


series = np.linspace(0,100,50000)
freq   = np.random.uniform(5,50,50)
fwhm   = np.random.uniform(0.01,0.05,50)
amps   = np.random.uniform(5,500,50)


for _ in range(100):
    sum(lorz3(series, item1, item2, item3)
        for (item1,item2,item3) in zip(freq, fwhm, amps))

Notice how I tweaked the summation for lorz3 into a plain old Python sum. This is faster in my tests since it avoids the temporary array construction.

请注意，我是如何将lorz3的求和调整为普通的Python求和的。这在我的测试中更快，因为它避免了临时数组的构造。

Here are the results of some profiling I did:

以下是我做的一些分析的结果：

perf stat -ddd ./lorz2.py

 Performance counter stats for './lorz2.py':

           2729.16 msec task-clock:u                     #    1.000 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
            217114      page-faults:u                    #   79.554 K/sec                     
        8192141440      cycles:u                         #    3.002 GHz                         (38.43%)
        3178961202      instructions:u                   #    0.39  insn per cycle              (46.12%)
         426575242      branches:u                       #  156.303 M/sec                       (53.81%)
           2177628      branch-misses:u                  #    0.51% of all branches             (61.51%)
       42020185035      slots:u                          #   15.397 G/sec                       (69.20%)
         323473974      topdown-retiring:u               #      0.6% Retiring                   (69.20%)
       33616148028      topdown-bad-spec:u               #     67.1% Bad Speculation            (69.20%)
         371211166      topdown-fe-bound:u               #      0.7% Frontend Bound             (69.20%)
       15767347418      topdown-be-bound:u               #     31.5% Backend Bound              (69.20%)
         813550722      L1-dcache-loads:u                #  298.096 M/sec                       (69.19%)
         546814255      L1-dcache-load-misses:u          #   67.21% of all L1-dcache accesses   (69.21%)
          82889242      LLC-loads:u                      #   30.372 M/sec                       (69.22%)
          67633317      LLC-load-misses:u                #   81.59% of all LL-cache accesses    (69.24%)
   <not supported>      L1-icache-loads:u                                                     
           9705348      L1-icache-load-misses:u          #    0.00% of all L1-icache accesses   (30.81%)
         864895659      dTLB-loads:u                     #  316.909 M/sec                       (30.79%)
            117310      dTLB-load-misses:u               #    0.01% of all dTLB cache accesses  (30.78%)
   <not supported>      iTLB-loads:u                                                          
             85530      iTLB-load-misses:u               #    0.00% of all iTLB cache accesses  (30.76%)
   <not supported>      L1-dcache-prefetches:u                                                
   <not supported>      L1-dcache-prefetch-misses:u                                           

       2.729696014 seconds time elapsed

       1.932708000 seconds user
       0.796504000 seconds sys

And here the faster version:

这里是更快的版本：


 Performance counter stats for './lorz3.py':

            878.49 msec task-clock:u                     #    0.999 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
             52869      page-faults:u                    #   60.182 K/sec                     
        3704170220      cycles:u                         #    4.217 GHz                         (38.22%)
        3735225800      instructions:u                   #    1.01  insn per cycle              (45.96%)
         568575253      branches:u                       #  647.221 M/sec                       (53.70%)
           2580294      branch-misses:u                  #    0.45% of all branches             (61.43%)
       18355798588      slots:u                          #   20.895 G/sec                       (69.17%)
        3328525030      topdown-retiring:u               #     17.5% Retiring                   (69.17%)
        6982401815      topdown-bad-spec:u               #     36.6% Bad Speculation            (69.17%)
        1297505291      topdown-fe-bound:u               #      6.8% Frontend Bound             (69.17%)
        7459691283      topdown-be-bound:u               #     39.1% Backend Bound              (69.17%)
         858082535      L1-dcache-loads:u                #  976.773 M/sec                       (69.28%)
         430569310      L1-dcache-load-misses:u          #   50.18% of all L1-dcache accesses   (69.40%)
          15723297      LLC-loads:u                      #   17.898 M/sec                       (69.49%)
             73709      LLC-load-misses:u                #    0.47% of all LL-cache accesses    (69.50%)
   <not supported>      L1-icache-loads:u                                                     
          38705486      L1-icache-load-misses:u          #    0.00% of all L1-icache accesses   (30.72%)
         860276161      dTLB-loads:u                     #  979.270 M/sec                       (30.60%)
             86213      dTLB-load-misses:u               #    0.01% of all dTLB cache accesses  (30.51%)
   <not supported>      iTLB-loads:u                                                          
             91069      iTLB-load-misses:u               #    0.00% of all iTLB cache accesses  (30.50%)
   <not supported>      L1-dcache-prefetches:u                                                
   <not supported>      L1-dcache-prefetch-misses:u                                           

       0.878946776 seconds time elapsed

       0.852205000 seconds user
       0.026744000 seconds sys

Notice how the number of instructions is actually slightly higher in the faster code, which makes sense since it is less vectorized, but the much higher instructions per cycle make it faster overall. There are twice as many LLC loads in the slower version, of which most miss while here almost all hit. I'm not sure how to interpret the topdown-bad-spec counter. Maybe someone else can comment on that.

请注意，在速度较快的代码中，指令数量实际上略高，这是有意义的，因为它的矢量化程度较低，但每个周期的指令数量要高得多，因此总体上速度更快。在较慢的版本中有两倍的有限责任公司的加载，其中大多数未命中，而这里几乎所有命中。我不知道如何解释自上而下的不良规格计数器。也许其他人可以对此发表评论。

The CPU even clocks down (this is reproducible) which supports the idea that it is simply waiting on memory.

CPU甚至向下计时(这是可重现的)，这支持了它只是在等待内存的想法。

Further, notice the sys time in the last line. lorz2 spends 28% of its runtime in kernel space. Since it doesn't do anything IO-related, that is all memory allocation and deallocation overhead.

此外，请注意最后一行中的sys时间。Lorz2将其运行时的28%花费在内核空间中。因为它不做任何与IO相关的事情，所以这就是所有的内存分配和释放开销。

We can look a bit further at the stall reasons:

我们可以更深入地了解停滞的原因：

perf stat -e cycles,l1d_pend_miss.l2_stall,cycle_activity.stalls_l3_miss ./lorz2.py

 Performance counter stats for './lorz2.py':

        8446540078      cycles:u                                                              
        1953955881      l1d_pend_miss.l2_stall:u                                              
        3050292324      cycle_activity.stalls_l3_miss:u                                       

       2.748141433 seconds time elapsed

       1.959570000 seconds user
       0.788443000 seconds sys

perf stat -e cycles,l1d_pend_miss.l2_stall,cycle_activity.stalls_l3_miss ./lorz3.py

 Performance counter stats for './lorz3.py':

        3674547216      cycles:u                                                              
         303870088      l1d_pend_miss.l2_stall:u                                              
          16939496      cycle_activity.stalls_l3_miss:u                                       

       0.869909182 seconds time elapsed

       0.848122000 seconds user
       0.021752000 seconds sys

So, the lorz2 version just stalls constantly on level 2 or 3 cache misses.

因此，lorz2版本只是在2级或3级缓存未命中时不断停滞。

We can further look at a simple perf report

我们可以进一步查看一个简单的Perf报告

perf record ./lorz2.py
perf report


# Overhead  Command  Shared Object                                      Symbol                                         
# ........  .......  .................................................  ...............................................
#
    32.44%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_multiply
    27.03%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_divide
    17.34%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_add
     6.93%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_subtract
     6.12%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_pairwise_sum
     5.32%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_square
     0.51%  python3  [unknown]                                          [k] 0xffffffffb2800fe7
     0.46%  python3  libpython3.11.so.1.0                               [.] _PyEval_EvalFrameDefault
     0.19%  python3  libpython3.11.so.1.0                               [.] unicodekeys_lookup_unicode
     0.18%  python3  libpython3.11.so.1.0                               [.] gc_collect_main
...

perf record ./lorz3.py
perf report

# Overhead  Command  Shared Object                                      Symbol                                       
# ........  .......  .................................................  .............................................
#
    27.56%  python3  libblas.so.3.11.0                                  [.] ddot_
    27.47%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_divide
     8.64%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_subtract
     8.34%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_add
     5.84%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_multiply
     3.70%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_square
     1.47%  python3  libpython3.11.so.1.0                               [.] _PyEval_EvalFrameDefault
     1.38%  python3  libgcc_s.so.1                                      [.] execute_cfa_program
     0.89%  python3  libgcc_s.so.1                                      [.] uw_update_context_1
     0.88%  python3  libgcc_s.so.1                                      [.] _Unwind_Find_FDE
     0.64%  python3  libgcc_s.so.1                                      [.] uw_frame_state_for
     0.61%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] ufunc_generic_fastcall
...

Huh, that's interesting. Where does the dot product come from? I assume this is how linalg.norm is implemented for simple vectors.

哈，这很有趣。点阵产品从何而来？我假设这就是对简单向量实现linalg.Norm的方式。

Incidentally, we can speed up the lorz2 and lorz3 versions slightly via 3 measures:

顺便说一句，我们可以通过3项措施略微加快lorz2和lorz3版本的速度：

Fold multiplication and summation into one matrix multiplication

Reorder some operations to execute them on smaller arrays (or scalars)

Replace divisions with multiplications by the inverse

def lorz2a(freq_series, freq, fwhm, amp):
    numerator   = fwhm[:,None] * (0.5 / np.pi)
    denominator = (freq_series - freq[:,None])**2 + 0.25 * fwhm[:,None]**2
    lor         = numerator / denominator
    return (amp / np.linalg.norm(lor, axis=1)) @ lor

    
def lorz3a(freq_series, freq, fwhm, amp):
    numerator   = fwhm * (0.5 / np.pi)
    denominator = (freq_series - freq)**2 + 0.25 * fwhm**2
    lor         = numerator / denominator
    main_peak   = amp / np.linalg.norm(lor) * lor
    return main_peak

This does not change anything on the overall trends, however.

然而，这并没有改变总体趋势的任何变化。

In conclusion

Numpy vectorization primarily helps reducing per-call overhead. Once the arrays are large enough, we don't get much benefit from it since the remaining interpreter overhead is small compared to the computations itself. Simultaneously, larger arrays result in reduced memory efficiency. Typically there is a sweet-spot somewhere around L2 or L3 cache size. The lorz3 implementation hits this spot better than the others.

块块矢量化主要有助于降低每次调用的开销。一旦数组足够大，我们就不会从中获得太多好处，因为与计算本身相比，剩余的解释器开销很小。同时，较大的阵列会导致内存效率降低。通常，在L2或L3缓存大小附近有一个最佳位置。Lorz3实现比其他实现更好地切中了这一点。

For a smaller series size and a larger size of the other arrays, we can expect lorz2 to perform better. For example this data set makes my lorz2a faster than my lorz3a:

对于较小的系列大小和较大的其他数组大小，我们可以预期lorz2的性能会更好。例如，此数据集使我的lorz2a比我的lorz3a快：

series = np.linspace(0,100,1000)
freq   = np.random.uniform(5,50,2000)
fwhm   = np.random.uniform(0.01,0.05,2000)
amps   = np.random.uniform(5,500,2000)

Numpy's simple, eager evaluation scheme puts the onus of tuning for this on the user. Other libraries like NumExpr try to avoid this.

Numpy简单而急切的评估方案将调整这一点的责任推给了用户。像NumExpr这样的其他库试图避免这种情况。

更多回答

yes, I've often noted that a few iterations (e.g.5) on a complex task can be faster than the equivalent 'whole array' code. I attribute it to the increased complexity of handling larger arrays, though others can explain it better in terms of memory layout, paging and caching.

是的，我经常注意到，复杂任务的几次迭代(例如5次)可以比同等的“整个数组”代码更快。我将其归因于处理更大数组的复杂性增加，尽管其他人可以从内存布局、分页和缓存方面更好地解释这一点。

What a detailed answer! I highly appreciate it. Learnt quite a few things today. Thank you.

多详细的回答啊！我非常感激。今天学了不少东西。谢谢。

php - for 循环 vs while 循环 vs foreach 循环 PHP
我是 PHP 新手。我一直在脚本中使用 for 循环、while 循环、foreach 循环。我想知道哪个性能更好？选择循环的标准是什么？当我们在另一个循环中循环时应该使用哪个？我一直想知道要
java - 编写 for 循环/while 循环？
我在高中的编程课上，我的作业是制作一个基本的小计和顶级计算器，但我在一家餐馆工作，所以制作一个只能让你在一种食物中读到。因此，我尝试让它能够接收多种食品并将它们添加到一个价格变量中。抱歉，如果某些代码
javascript - 为成分编写 while 循环/for 循环。
这是我正在学习的一本教科书。 var ingredients = ["eggs", "milk", "flour", "sugar", "baking soda", "baking powder",
Javascript 添加前导零适用于 while 循环，但不适用于 for 循环
我正在从字符串中提取数字并将其传递给函数。我想给它加 1，然后返回字符串，同时保留前导零。我可以使用 while 循环来完成此操作，但不能使用 for 循环。 for 循环只是跳过零。 var add
java - 程序适用于 for 循环，但不适用于 while 循环？
编辑:我已经在程序的输出中进行了编辑。该程序要求估计给定值 mu。用户给出一个值 mu，同时还提供了四个不等于 1 的不同数字(称为 w、x、y、z)。然后，程序尝试使用 de Jaeger 公式找
Java For 循环 vs While 循环，奇怪的行为和时间性能
我正在编写一个算法，该算法对一个整数数组从末尾到开头执行一个大循环，其中包含一个 if 条件。第一次条件为假时，循环可以终止。因此，对于 for 循环，如果条件为假，它会继续迭代并进行简单的变量更改
java - While 循环 vs For 循环，哪个更节省内存!
现在我已经习惯了在内存非常有限的情况下进行编程，但我没有答案的一个问题是:哪个内存效率更高；- for(;;) 或 while() ？还是它们可以平等互换？如果有的话，还要对效率问题发表评论! 最佳答
java - 一个 while 循环，其中包含一个 if 语句和一个 for 循环
这个问题已经有答案了: How do I compare strings in Java? (23 个回答) 已关闭 8 年前。我正在尝试创建一个小程序，我可以在其中读取该程序的单词。如果单词有 6
python - 弹出索引超出范围 - 作业(列表，for 循环，while 循环)
这个问题在这里已经有了答案: python : list index out of range error while iteratively popping elements (12 个答案) 关
java - JOptionPane.showInputDialog 循环(使用 do while 循环)
我正在尝试向用户请求 4 到 10 之间的整数。如果他们回答超出该范围，它将进入循环。当用户第一次正确输入数字时，它不会中断并继续执行 else 语句。如果用户在 else 语句中正确输入数字，它将正
php - 嵌套的 foreach 循环，break inside 循环
我尝试创建一个带有嵌套 foreach 循环的列表。第一个循环是循环一些数字，第二个循环是循环日期。我想给一个日期写一个数字。所以还有另一个功能来检查它。但结果是数字多次写入日期。 Out 是这样的:
java - 在 while 循环(或 for 循环)内创建一个数组，然后在外部使用该数组
我想要做的事情是使用循环创建一个数组，然后在另一个类中调用该数组，这不会做，也可能永远不会做。解决这个问题最好的方法是什么？我已经寻找了所有解决方案，但它们无法编译。感谢您的帮助。 import ja
php - 嵌套的 foreach 循环，break inside 循环
我尝试创建一个带有嵌套 foreach 循环的列表。第一个循环是循环一些数字，第二个循环是循环日期。我想给一个日期写一个数字。所以还有另一个功能来检查它。但结果是数字多次写入日期。 Out 是这样的:
c - 如何将 'convert' 两个(for 循环)转为一个(while 循环)？
我正在模拟一家快餐店三个多小时。这三个小时分为 18 个间隔，每个间隔 600 秒。每个间隔都会输出有关这 600 秒内发生的情况的统计信息。我原来的结构是这样的: int i; for (i=0;
javascript - ie javascript for in 循环 vs chrome for in 循环
这个问题已经有答案了: IE8 for...in enumerator (3 个回答) How do I check if an object has a specific property in J
java - 编程语言中的 for 循环 VS while 循环，c++/java？
哪个对性能更好？这可能与其他编程语言不一致，所以如果它们不同，或者如果你能用你对特定语言的知识回答我的问题，请解释。我将使用 c++ 作为示例，但我想知道它在 java、c 或任何其他主流语言中的工
c++ - C++11 段错误中基于范围的 for 循环，但不是常规 for 循环
这个问题不太可能帮助任何 future 的访问者；它只与一个小的地理区域、一个特定的时间点或一个非常狭窄的情况有关，这些情况并不普遍适用于互联网的全局受众。为了帮助使这个问题更广泛地适用，visit
c - while 循环(和 for 循环)上的 scanf 错误，永远扫描
我是 C 编程和编写代码的新手，以确定 M 测试用例的质因数分解。如果我一次只扫描一次，该功能本身就可以工作，但是当我尝试执行 M 次时却惨遭失败。我不知道为什么 scanf() 循环有问题。 in
javascript - 进行修改时应出现 'for-of' 循环，而不是 'for' 循环
这个问题已经有答案了: JavaScript by reference vs. by value [duplicate] (4 个回答) 已关闭 3 年前。我在使用 TSlint 时遇到问题，并且理
javascript - 为 Charts.js 添加 for 循环/foreach 循环
我尝试在下面的代码中添加 foreach 或 for 循环，以便为 Charts.js 创建多个数据集。这将允许我在此折线图上创建多条线。我有一个 PHP 对象，我可以对其进行编码以稍后填充变量，但

bug小助手

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

Vectorizing taking longer than loop(矢量化比循环花费的时间更长)

In conclusion