Context
I am attempting to cythonise a function to replace a filtering operation on an extremely large pandas.DataFrame
(100,000,000 rows). Currently it compares values in the df
columns (Z, T, M
) against single values (Z1, Z2, T1, T2, M1, M2
), but will eventually be performed rowise for 1D arrays of those values:
我正在尝试对一个函数执行Cython化操作,以替换对一个非常大的熊猫的过滤操作。DataFrame(100,000,000行)。目前,它将DF列(Z、T、M)中的值与单个值(Z1、Z2、T1、T2、M1、M2)进行比较,但最终将对这些值的一维数组执行反向操作:
# Python
# df: pandas.DataFrame
# Z, T, M: columns of df, float64 values
# I: column of df, uint32 values
# Z1, Z2, T1, T2, M1, M2: float64 values
res = df.query(
f"{Z1} < Z < {Z2} and {T1} < T < {T2} and {M1} < M < {M2}",
engine="numexpr",
)[["T", "I", "M"]]
df.query(..., engine="numexpr")
is already an improvement in speed over df.loc[(df['Z'] > Z1) & (df['Z'] < Z2) & ...]
. I have even improved it further using the decorator @jit(nopython=False, cache=True, parallel=True)
.
与df.loc[(df[‘Z’]>z1)&(df[‘Z’]
Error
I am now attempting to refactor this for use with cython. However, after many multiple rounds of bugfixing and refinement I cannot get it to build using the following command (executed in the terminal in the same directory as the module.pyx
, setup.py
and pyproject.toml
):
我现在正在尝试重构它,以便与Cython一起使用。然而,经过多次多次错误修复和改进后,我无法使用以下命令(在终端中与mode.pyx、setup.py和pyproject t.toml相同的目录中执行)来构建它:
python setup.py build_ext --inplace
I am now stuck with the following error which I can't resolve:
我现在遇到了以下无法解决的错误:
Error compiling Cython file: Invalid types for '>' (float64_t[::1], float64_t)
This is my first time cythonising code, I have read the following useful documentation:
这是我第一次使用Cython化代码,我已经阅读了以下有用的文档:
Files
setup.py
#!/usr/bin/env python
import numpy
from Cython.Build import build_ext, cythonize
from setuptools import Extension, setup
extensions = [Extension("module", ["module.pyx"])]
setup(
ext_modules=cythonize(extensions),
include_dirs=[numpy.get_include()],
cmdclass={"build_ext": build_ext},
)
pyproject.toml
[build-system]
requires = ["setuptools", "wheel", "Cython"]
module.pyx
#!/usr/bin/env python
# cython: infer_types=True
import numpy as np
DTYPE = np.dtype(
[('T', np.float64), ('I', np.uint32), ('M', np.float64)]
)
cimport cython
cimport numpy as np
ctypedef packed struct result_struct:
np.float64_t T
np.uint32_t I
np.float64_t M
@cython.boundscheck(False)
@cython.wraparound(False)
cpdef filter_function(
np.ndarray[(
np.uint32_t,
np.uint32_t,
np.uint32_t,
np.uint32_t,
np.float64_t,
np.float64_t,
np.float64_t,
np.float64_t
), ndim=2] df,
np.float64_t Z1,
np.float64_t Z2,
np.float64_t T1,
np.float64_t T2,
np.float64_t M1,
np.float64_t M2
):
# Types
cdef np.uint32_t[:, ::1] I = df[3]
cdef np.float64_t[:, ::1] Z = df[4]
cdef np.float64_t[:, ::1] M = df[5]
cdef np.float64_t[:, ::1] T = df[7]
cdef Py_ssize_t i, n
# Output array
n = df.shape[0]
cdef np.ndarray[result_struct, ndim=2] result = np.zeros((n, 3), dtype=DTYPE)
cdef np.float64_t[:, ::1] result_T = result['T']
cdef np.uint32_t[:, ::1] result_I = result['I']
cdef np.float64_t[:, ::1] result_M = result['M']
# Filter
for i in range(n):
if (Z[i] > Z1) and (Z[i] < Z2) and (T[i] > T1) and (T[i] < T2) and (M[i] > M1) and (M[i] < M2):
result_T[i] = T[i]
result_I[i] = I[i].astype(np.uint32)
result_M[i] = M[i]
# Remove rows with 0
result = result[(result != 0).all(axis=1)]
return result
I appreciate any help offered.
我很感激你提供的帮助。
If there is a better way to achieve the task of filtering on values/1D arrays of values, I welcome suggestions.
如果有更好的方法来完成对值/一维值数组进行过滤的任务,我欢迎您的建议。
Also, I'd like to check that my use of memoryviews and contiguous data is correct. Would I need to create a subset of df
with the columns 'I', 'Z', 'M', 'T'
so that 'T'
is contiguous, or if it is a memoryview of df
does it not matter (df.to_numpy()
is used before supplying as an argument to the filter_function()
).
此外,我还想检查一下我对内存视图和连续数据的使用是否正确。我是否需要创建一个包含列‘I’、‘Z’、‘M’、‘T’的df子集,以便‘T’是连续的,或者如果它是df的记忆视图,这无关紧要(在作为参数提供给Filter_Function()之前使用df.to_numpy()。
更多回答
I wasn't able to get the structured array approach to compile. However, I didn't spend a lot of time on that, as my understanding is that converting from a structured array to a DataFrame requires copying the array, as Pandas doesn't support NumPy's structured arrays. Instead, I treated each column as a separate array.
我无法使用结构化数组方法进行编译。不过,我并没有在这方面花太多时间,因为我的理解是,从结构化数组转换为DataFrame需要复制数组,因为Pandas不支持NumPy的结构化数组。相反,我将每列视为单独的数组。
This code manipulates the DataFrame a little, so it does interact with Python a bit, but I kept that out of the loop. It ended up being about 3x faster than the naive Pandas approach.
这段代码对DataFrame进行了一些操作,因此它确实与Python进行了一些交互,但我将其排除在循环之外。最终,它的速度比幼稚的熊猫方法快了大约3倍。
I'm able to get this to work with the following changes:
我可以通过以下更改使其正常工作:
module.pyx:
Mode.pyx:
import numpy as np
import pandas as pd
cimport numpy as cnp
cimport cython
cnp.import_array()
@cython.boundscheck(False)
@cython.wraparound(False)
cpdef filter_df_cy_no_pd(df, double Z1, double Z2, double T1, double T2, double M1, double M2):
# Get NumPy array for each column
cdef cnp.float64_t[::1] Z = df['Z'].values
cdef cnp.float64_t[::1] T = df['T'].values
cdef cnp.float64_t[::1] M = df['M'].values
cdef cnp.int32_t[::1] I = df['I'].values
cdef Py_ssize_t n = Z.shape[0]
T_out_arr = np.empty(n + 1, dtype='float64')
cdef cnp.float64_t[::1] T_out = T_out_arr
I_out_arr = np.empty(n + 1, dtype='int32')
cdef cnp.int32_t[::1] I_out = I_out_arr
M_out_arr = np.empty(n + 1, dtype='float64')
cdef cnp.float64_t[::1] M_out = M_out_arr
cdef cython.bint in_range = 0
cdef Py_ssize_t out_idx = 0
cdef Py_ssize_t burn_idx = n
for i in range(n):
in_range = (
(Z1 < Z[i]) and (Z[i] < Z2) and
(T1 < T[i]) and (T[i] < T2) and
(M1 < M[i]) and (M[i] < M2)
)
write_idx = out_idx if in_range else burn_idx
T_out[write_idx] = T[i]
I_out[write_idx] = I[i]
M_out[write_idx] = M[i]
out_idx += 1 if in_range else 0
T_out_arr = T_out_arr[:out_idx]
I_out_arr = I_out_arr[:out_idx]
M_out_arr = M_out_arr[:out_idx]
return pd.DataFrame({
'T': T_out_arr,
'I': I_out_arr,
'M': M_out_arr,
}, copy=False)
setup.py
Setup.py
#!/usr/bin/env python
import numpy
from Cython.Build import build_ext, cythonize
from setuptools import Extension, setup
extensions = [
Extension(
"module",
["module.pyx"],
define_macros=[("NPY_NO_DEPRECATED_API", "NPY_1_7_API_VERSION")],
extra_compile_args=["-O3"],
)
]
setup(
ext_modules=cythonize(extensions, language_level=3),
include_dirs=[numpy.get_include()],
cmdclass={"build_ext": build_ext},
)
Example test code:
测试代码示例:
import module
import numpy as np
import pandas as pd
import time
N = 1000000*100
df = pd.DataFrame({
"Z": (np.random.rand(N) * 10).astype('float64'),
"T": (np.random.rand(N) * 10).astype('float64'),
"M": (np.random.rand(N) * 10).astype('float64'),
"I": (np.random.rand(N) * 10).astype('int32'),
})
t0 = time.time()
df_filt = module.filter_df_cy_no_pd(df, 1, 9, 1, 9, 1, 9)
t = time.time()
print(f"Duration: {t - t0:.3f}")
print(df_filt)
更多回答
Thank you for your help. I am able to compile it. However, there were 2 warnings: ignoring unknown compiler argument -O3
; warning C4005: '__pyx_nonatomic_int_type': macro redefinition
. I think the first is due to my build environment is Windows, so the compiler MSVC lacks -O3
as an argument; I think the equivalent is -O2
from the documentation. Also, after editing, the compilation fails if the module is currently loaded. Is there a way around it other than restarting IPython kernel/VS Code?
谢谢你的帮助。我能够把它汇编起来。但是,有两个警告:忽略未知的编译器参数-03;警告C4005:‘__PYX_NONATIONAL_INT_TYPE’:宏重定义。我认为第一个原因是我的构建环境是Windows,所以编译器MSVC缺少参数-o3;我认为文档中的等价物是-o2。此外,在编辑之后,如果当前加载了模块,则编译失败。除了重启IPython内核/VS代码之外,还有其他方法可以绕过它吗?
I don't specifically need to have the pandas
df
as the arg, I could use df['Z'].to_numpy()
for each column and supply them separately as args if it would be quicker. I also don't mind if the output is a numpy.ndarray
if that makes it quicker. If I am to supply arrays instead of individual values to replace Z1, Z2, T1, T2, M1, M2
, how would this modify the for
loop?
我并不特别需要将熊猫df作为arg,我可以为每个列使用df[‘Z’].to_numpy(),如果这样会更快的话,可以将它们作为arg单独提供。如果输出是一个数字,我也不介意。ndarray,如果这样会更快的话。如果我要提供数组而不是单个值来替换Z1、Z2、T1、T2、M1、M2,这将如何修改for循环?
Sorry, the method I'm using ends up restarting Python every time I want to compile/load the module, so I've never run into that.
对不起,每次我想编译/加载模块时,我使用的方法都会重新启动Python,所以我从来没有遇到过这种情况。
Yes, that seems correct. I think -O2 is the equivalent on Windows.
是的,这似乎是正确的。我认为-O2在Windows上是等同的。
The issue is that you need to convert the DataFrame into a NumPy representation at some point. You can do it inside the function, or outside the function, but neither is faster. It makes the function faster, but only because you're pushing some of the work into the caller. I did it inside the function because it makes the API simpler.
问题是您需要在某个时候将DataFrame转换为NumPy表示形式。您可以在函数内部或函数外部执行此操作,但这两种操作都不会更快。它使函数速度更快,但这只是因为您将一些工作推给了调用者。我在函数中这样做是因为它使API更简单。
我是一名优秀的程序员,十分优秀!