I wanted to check the LLVM IR for a vector addition from numba and noticed it generates a lot of IR just for a simple add. I was hoping a simple "add" IR but it generates 2000 lines of LLVM IR. Is there a way to get a minimal code?
我想检查来自Numba的向量加法的LLVM IR,注意到它只为一个简单的加法生成了大量的IR。我希望有一个简单的“添加”IR,但它会生成2000行LLVM IR。有没有办法获得最低限度的代码?
from numba import jit
import numpy as np
@jit(nopython=True,nogil=True)
def mysum(a,b):
return a+b
a, b = 1.3 * np.ones(5), 2.2 * np.ones(5)
mysum(a, b)
# Get the llvm IR
llvm_ir =list(mysum.inspect_llvm().values())[0]
print(llvm_ir)
with open("llvm_ir.ll", "w") as file:
file.write(llvm_ir)
# Get the assembly code
asm = list(mysum.inspect_asm().values())[0]
print(asm)
with open("llvm_ir.asm", "w") as file:
file.write(asm)
更多回答
Can you please share the IR? I imagine it has to do with handling different types of objects, as it cannot guarantee the inputs are integers.
你能分享一下红外线吗?我想它与处理不同类型的对象有关,因为它不能保证输入是整数。
优秀答案推荐
Numba generates 3 functions. The first one does the actual computation. The second one is a wrapping function meant to be called from CPython. It converts CPython dynamic objects to native types for the input values and does the opposite operation for returned values. The last function is meant to be called from other Numba functions (if any).
Numba生成3个函数。第一个做实际的计算。第二个是一个包装函数,用于从CPython调用。它将CPython动态对象转换为输入值的本机类型,并对返回值执行相反的操作。最后一个函数意味着从其他Numba函数(如果有的话)调用。
Converting Numpy arrays is not a trivial task (Numpy arrays are dynamic object containing a bunch of information like a memory buffer, the number of dimensions, the stride+size along each dimension, the dynamic Numpy type, etc.). This is why the code is significantly bigger with Numpy arrays than with simpler data-types like floating-point values. Indeed, the whole LLVM IR code is 20 times smaller in this case and this wrapping function is very simple.
转换Numpy数组不是一项简单的任务(Numpy数组是动态对象,包含一系列信息,如内存缓冲区、维度数量、沿每个维度的步长+大小、动态Numpy类型等)。这就是为什么使用Numpy数组的代码比使用浮点值等更简单的数据类型的代码要大得多。事实上,在这种情况下,整个LLVM IR代码要小20倍,并且这个包装函数非常简单。
Still, the main issue is not much the wrapping function, but the first one doing the actual computation (75% of the LLVM IR code). One reason is that a + b
create a new temporary Numpy array that should be initialized and filled using an implicit loop. This implicit operation generates more code than if the code is done manually. This is certainly because Numba needs to case about many possible cases that may never happens in practice. For example, the LLVM IR of the following Numba function is twice smaller:
尽管如此,主要问题不是包装函数,而是第一个进行实际计算的函数(75%的LLVMIR代码)。原因之一是a+b创建了一个新的临时Numpy数组,应该使用隐式循环对其进行初始化和填充。与手动完成代码相比,这种隐式操作会生成更多代码。这当然是因为Numba需要处理许多可能的案件,而这些案件在实践中可能永远不会发生。例如,以下Numba函数的LLVM IR要小两倍:
@jit('float64[::1](float64[::1], float64[::1])', nopython=True,nogil=True)
def mysum(a,b):
out = np.empty(a.size, dtype=np.float64)
for i in range(a.size):
out[i] = a[i] + b[i]
return out
If we remove the loop, then it is again twice smaller. This shows that the Numpy array creation/initialization take a significant fraction of the code space. The loop also takes a significant space because Numba need to support the wrap-around feature supported by Numpy arrays, and also because Numpy arrays does not have a typed data buffer. In C, arrays and pointers are much simpler and there is no wrap-around.
如果我们去掉环路,那么它又小了两倍。这表明Numpy数组的创建/初始化占用了很大一部分代码空间。循环还会占用大量空间,因为Numba需要支持Numpy数组支持的回绕特性,还因为Numpy数组没有类型化的数据缓冲区。在C中,数组和指针要简单得多,而且没有回绕。
The generation of pretty huge IR/ASM code is pretty common in high-level languages. The code is often big due to advanced features, poor code-size optimizations. Reducing the size of the generated code is a significant work and it sometimes conflits with performance. Indeed, to get high-performance codes, compilers often need unroll loops, split the code in different variants to mitigate the cost of higher level features (eg. pointer aliasing, vectorization, removal of wrap-around) resulting in significantly bigger IR/ASM codes.
在高级语言中,生成相当大的IR/ASM代码是很常见的。由于高级功能、糟糕的代码大小优化,代码往往很大。减少生成代码的大小是一项重要的工作,有时它与性能混为一谈。事实上,为了获得高性能的代码,编译器通常需要展开循环,将代码拆分到不同的变体中,以降低更高级别功能的成本(例如。指针别名、矢量化、去除回绕)导致显著更大的IR/ASM代码。
更多回答
我是一名优秀的程序员,十分优秀!