Error compiling Cython file: Invalid types for `'>'` (`float64_t[::1]`, `float64_t`)(编译Cython文件时出错：`‘>’`(`flat64_t[：：1]`，`flat64_t`)的类型无效)

转载作者：bug小助手更新时间：2023-10-25 11:41:02

Context

I am attempting to cythonise a function to replace a filtering operation on an extremely large pandas.DataFrame (100,000,000 rows). Currently it compares values in the df columns (Z, T, M) against single values (Z1, Z2, T1, T2, M1, M2), but will eventually be performed rowise for 1D arrays of those values:

我正在尝试对一个函数执行Cython化操作，以替换对一个非常大的熊猫的过滤操作。DataFrame(100,000,000行)。目前，它将DF列(Z、T、M)中的值与单个值(Z1、Z2、T1、T2、M1、M2)进行比较，但最终将对这些值的一维数组执行反向操作：

# Python
# df: pandas.DataFrame
# Z, T, M: columns of df, float64 values
# I: column of df, uint32 values
# Z1, Z2, T1, T2, M1, M2: float64 values

res = df.query(
    f"{Z1} < Z < {Z2} and {T1} < T < {T2} and {M1} < M < {M2}",
    engine="numexpr",
)[["T", "I", "M"]]

df.query(..., engine="numexpr") is already an improvement in speed over df.loc[(df['Z'] > Z1) & (df['Z'] < Z2) & ...]. I have even improved it further using the decorator @jit(nopython=False, cache=True, parallel=True).

与df.loc[(df[‘Z’]>z1)&(df[‘Z’]

Error

I am now attempting to refactor this for use with cython. However, after many multiple rounds of bugfixing and refinement I cannot get it to build using the following command (executed in the terminal in the same directory as the module.pyx, setup.py and pyproject.toml):

我现在正在尝试重构它，以便与Cython一起使用。然而，经过多次多次错误修复和改进后，我无法使用以下命令(在终端中与mode.pyx、setup.py和pyproject t.toml相同的目录中执行)来构建它：

python setup.py build_ext --inplace

I am now stuck with the following error which I can't resolve:

我现在遇到了以下无法解决的错误：

Error compiling Cython file: Invalid types for '>' (float64_t[::1], float64_t)

This is my first time cythonising code, I have read the following useful documentation:

这是我第一次使用Cython化代码，我已经阅读了以下有用的文档：

Installing cython

Source files and compilation

Working with python arrays

Basic tutorial

Typed memoryviews

Cython for numpy users

Pandas enhancing performance

Files

setup.py

#!/usr/bin/env python

import numpy
from Cython.Build import build_ext, cythonize
from setuptools import Extension, setup

extensions = [Extension("module", ["module.pyx"])]

setup(
    ext_modules=cythonize(extensions),
    include_dirs=[numpy.get_include()],
    cmdclass={"build_ext": build_ext},
)

pyproject.toml

[build-system]
requires = ["setuptools", "wheel", "Cython"]

module.pyx

#!/usr/bin/env python
# cython: infer_types=True

import numpy as np
DTYPE = np.dtype(
    [('T', np.float64), ('I', np.uint32), ('M', np.float64)]
)

cimport cython
cimport numpy as np

ctypedef packed struct result_struct:
    np.float64_t T
    np.uint32_t I
    np.float64_t M

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef filter_function(
    np.ndarray[(
        np.uint32_t, 
        np.uint32_t, 
        np.uint32_t, 
        np.uint32_t, 
        np.float64_t, 
        np.float64_t, 
        np.float64_t, 
        np.float64_t
    ), ndim=2] df, 
    np.float64_t Z1, 
    np.float64_t Z2, 
    np.float64_t T1, 
    np.float64_t T2, 
    np.float64_t M1, 
    np.float64_t M2
):
    # Types
    cdef np.uint32_t[:, ::1] I = df[3]
    cdef np.float64_t[:, ::1] Z = df[4]
    cdef np.float64_t[:, ::1] M = df[5]
    cdef np.float64_t[:, ::1] T = df[7]

    cdef Py_ssize_t i, n

    # Output array
    n = df.shape[0]
    cdef np.ndarray[result_struct, ndim=2] result = np.zeros((n, 3), dtype=DTYPE)

    cdef np.float64_t[:, ::1] result_T = result['T']
    cdef np.uint32_t[:, ::1] result_I = result['I']
    cdef np.float64_t[:, ::1] result_M = result['M']

    # Filter
    for i in range(n):
        if (Z[i] > Z1) and (Z[i] < Z2) and (T[i] > T1) and (T[i] < T2) and (M[i] > M1) and (M[i] < M2):
            result_T[i] = T[i]
            result_I[i] = I[i].astype(np.uint32)
            result_M[i] = M[i]
    
    # Remove rows with 0
    result = result[(result != 0).all(axis=1)]

    return result

I appreciate any help offered.

我很感激你提供的帮助。

If there is a better way to achieve the task of filtering on values/1D arrays of values, I welcome suggestions.

如果有更好的方法来完成对值/一维值数组进行过滤的任务，我欢迎您的建议。

Also, I'd like to check that my use of memoryviews and contiguous data is correct. Would I need to create a subset of df with the columns 'I', 'Z', 'M', 'T' so that 'T' is contiguous, or if it is a memoryview of df does it not matter (df.to_numpy() is used before supplying as an argument to the filter_function()).

此外，我还想检查一下我对内存视图和连续数据的使用是否正确。我是否需要创建一个包含列‘I’、‘Z’、‘M’、‘T’的df子集，以便‘T’是连续的，或者如果它是df的记忆视图，这无关紧要(在作为参数提供给Filter_Function()之前使用df.to_numpy()。

更多回答

优秀答案推荐

I wasn't able to get the structured array approach to compile. However, I didn't spend a lot of time on that, as my understanding is that converting from a structured array to a DataFrame requires copying the array, as Pandas doesn't support NumPy's structured arrays. Instead, I treated each column as a separate array.

我无法使用结构化数组方法进行编译。不过，我并没有在这方面花太多时间，因为我的理解是，从结构化数组转换为DataFrame需要复制数组，因为Pandas不支持NumPy的结构化数组。相反，我将每列视为单独的数组。

This code manipulates the DataFrame a little, so it does interact with Python a bit, but I kept that out of the loop. It ended up being about 3x faster than the naive Pandas approach.

这段代码对DataFrame进行了一些操作，因此它确实与Python进行了一些交互，但我将其排除在循环之外。最终，它的速度比幼稚的熊猫方法快了大约3倍。

I'm able to get this to work with the following changes:

我可以通过以下更改使其正常工作：

module.pyx:

Mode.pyx：

import numpy as np
import pandas as pd
cimport numpy as cnp
cimport cython
cnp.import_array()

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef filter_df_cy_no_pd(df, double Z1, double Z2, double T1, double T2, double M1, double M2):
    # Get NumPy array for each column
    cdef cnp.float64_t[::1] Z = df['Z'].values
    cdef cnp.float64_t[::1] T = df['T'].values
    cdef cnp.float64_t[::1] M = df['M'].values
    cdef cnp.int32_t[::1] I = df['I'].values

    cdef Py_ssize_t n = Z.shape[0]
    T_out_arr = np.empty(n + 1, dtype='float64')
    cdef cnp.float64_t[::1] T_out = T_out_arr
    I_out_arr = np.empty(n + 1, dtype='int32')
    cdef cnp.int32_t[::1] I_out = I_out_arr
    M_out_arr = np.empty(n + 1, dtype='float64')
    cdef cnp.float64_t[::1] M_out = M_out_arr

    cdef cython.bint in_range = 0
    cdef Py_ssize_t out_idx = 0
    cdef Py_ssize_t burn_idx = n
    for i in range(n):
        
        in_range = (
            (Z1 < Z[i]) and (Z[i] < Z2) and
            (T1 < T[i]) and (T[i] < T2) and
            (M1 < M[i]) and (M[i] < M2)
        )
        write_idx = out_idx if in_range else burn_idx
            
        T_out[write_idx] = T[i]
        I_out[write_idx] = I[i]
        M_out[write_idx] = M[i]

        out_idx += 1 if in_range else 0

    T_out_arr = T_out_arr[:out_idx]
    I_out_arr = I_out_arr[:out_idx]
    M_out_arr = M_out_arr[:out_idx]

    return pd.DataFrame({
        'T': T_out_arr,
        'I': I_out_arr,
        'M': M_out_arr,
    }, copy=False)

setup.py

Setup.py

#!/usr/bin/env python

import numpy
from Cython.Build import build_ext, cythonize
from setuptools import Extension, setup

extensions = [
    Extension(
        "module",
        ["module.pyx"],
        define_macros=[("NPY_NO_DEPRECATED_API", "NPY_1_7_API_VERSION")],
        extra_compile_args=["-O3"],
    )
]
setup(
    ext_modules=cythonize(extensions, language_level=3),
    include_dirs=[numpy.get_include()],
    cmdclass={"build_ext": build_ext},
)

Example test code:

测试代码示例：

import module
import numpy as np
import pandas as pd
import time
N = 1000000*100
df = pd.DataFrame({
    "Z": (np.random.rand(N) * 10).astype('float64'),
    "T": (np.random.rand(N) * 10).astype('float64'),
    "M": (np.random.rand(N) * 10).astype('float64'),
    "I": (np.random.rand(N) * 10).astype('int32'),
})
t0 = time.time()
df_filt = module.filter_df_cy_no_pd(df, 1, 9, 1, 9, 1, 9)
t = time.time()
print(f"Duration: {t - t0:.3f}")
print(df_filt)

更多回答

Thank you for your help. I am able to compile it. However, there were 2 warnings: ignoring unknown compiler argument -O3; warning C4005: '__pyx_nonatomic_int_type': macro redefinition. I think the first is due to my build environment is Windows, so the compiler MSVC lacks -O3 as an argument; I think the equivalent is -O2 from the documentation. Also, after editing, the compilation fails if the module is currently loaded. Is there a way around it other than restarting IPython kernel/VS Code?

谢谢你的帮助。我能够把它汇编起来。但是，有两个警告：忽略未知的编译器参数-03；警告C4005：‘__PYX_NONATIONAL_INT_TYPE’：宏重定义。我认为第一个原因是我的构建环境是Windows，所以编译器MSVC缺少参数-o3；我认为文档中的等价物是-o2。此外，在编辑之后，如果当前加载了模块，则编译失败。除了重启IPython内核/VS代码之外，还有其他方法可以绕过它吗？

I don't specifically need to have the pandas df as the arg, I could use df['Z'].to_numpy() for each column and supply them separately as args if it would be quicker. I also don't mind if the output is a numpy.ndarray if that makes it quicker. If I am to supply arrays instead of individual values to replace Z1, Z2, T1, T2, M1, M2, how would this modify the for loop?

我并不特别需要将熊猫df作为arg，我可以为每个列使用df[‘Z’].to_numpy()，如果这样会更快的话，可以将它们作为arg单独提供。如果输出是一个数字，我也不介意。ndarray，如果这样会更快的话。如果我要提供数组而不是单个值来替换Z1、Z2、T1、T2、M1、M2，这将如何修改for循环？

Sorry, the method I'm using ends up restarting Python every time I want to compile/load the module, so I've never run into that.

对不起，每次我想编译/加载模块时，我使用的方法都会重新启动Python，所以我从来没有遇到过这种情况。

Yes, that seems correct. I think -O2 is the equivalent on Windows.

是的，这似乎是正确的。我认为-O2在Windows上是等同的。

The issue is that you need to convert the DataFrame into a NumPy representation at some point. You can do it inside the function, or outside the function, but neither is faster. It makes the function faster, but only because you're pushing some of the work into the caller. I did it inside the function because it makes the API simpler.

问题是您需要在某个时候将DataFrame转换为NumPy表示形式。您可以在函数内部或函数外部执行此操作，但这两种操作都不会更快。它使函数速度更快，但这只是因为您将一些工作推给了调用者。我在函数中这样做是因为它使API更简单。

文章推荐： Puppeteer Save PDF to absolute path(木偶师将PDF保存到绝对路径)

文章推荐： Point clouds registration(点云配准)

c++ - 在 C++11 中利用 int*_t、int_fast*_t 和 int_least*_t 之间的差异的一个很好的例子是什么？
根据在线文档，这些固定宽度整数类型之间存在差异。对于 int*_t，我们将宽度固定为 * 的值。然而对于其他两种类型，描述中使用形容词最快和最小来请求底层数据模型提供的最快或最小实例。 “最快”或“最
python - 代码 '_T = TypeVar(' _T')' 在 *.pyi 文件中是什么意思？
我是 Python 注释的新手(类型提示)。我注意到 pyi 中的许多类定义文件继承到 Generic[_T] , 和 _T = TypeVar('_T') . 我很困惑，_T 是什么意思？这里的意思
c++ - 如何将参数传递给 `_T()` ？
这个问题在这里已经有了答案: How to use a variable inside a _T wrapper? (3 个答案) 关闭 7 年前。我有以下代码: CString port = m
c++ - '_T' 未在此范围内声明？
要包含 _T() 宏，我应该包含什么文件？它转换我认为的文本文字。我以为它是 windows.h，但我已经包含了它。令人惊讶的是，我无法在 Google 上找到答案。最佳答案我在主题 Unico
c++ - 在我自己的命名空间中定义后缀 _t 数据类型
类型的后缀 _t 由 POSIX 保留，但是如果我在自己的命名空间中使用 _t 后缀定义自己的类型怎么办？最佳答案我同意 user6366161 的 answer，其中说“C 对 namespac
c++ - 为什么类型关键字以 "_t"后缀结尾？
我知道 size_t 有 _t 后缀，因为它的别名/typedef。但是我不明白为什么 char16_t, char32_t 和 wchar_t 包含 _t 后缀。最佳答案对于 wchar_t :
c++ - 如何在 _T 包装器中使用变量？
我想让这个字符串的主机名部分可变..目前，它只修复了这个 URL: _T(" --url=http://www.myurl.com/ --out=c:\\current.png"); 我想做这样的东西
c++ - 我可以将 _T() 宏与变量一起使用吗？
这个问题在这里已经有了答案: convert string to _T in cpp (6 个答案) 关闭 7 年前。 string pagexx = "http://website.com/" +
c# - _bstr_r 与 _T ("")
我有一个注册为 COM 对象的 .net 库，当在 C++ 项目中导入 .tlb 文件时，我得到这样的方法声明 virtual HRESULT __stdcall GetBid ( /*[
c++ - _T ("x") 没有按预期行事
我现在遇到了很多 Unicode 问题。据我了解，TCHAR 被定义为 wchar_t 或 char，具体取决于 _UNICODE 是否在某处定义，并且还有各种其他功能可以帮助解决这个问题。显然 _T
c - 我们如何在具有两个不同指针的同一结构中进行类型定义？ _t 这个词是做什么的？
关闭。这个问题需要details or clarity .它目前不接受答案。想改进这个问题吗？通过 editing this post 添加细节并澄清问题. 关闭 9 年前。 Improve t
silverstripe - 使用 _t() 方法转换 DataObject？
我正在尝试使用 _t() 方法翻译一个 DataObject。我一直在 Pages 上使用它没有问题，但它似乎不适用于数据对象。 class SliderItem extends DataObjec
c++ - C++ 中的作用域 _t 结尾名称
关于保留 _t 结尾名称的规则是否也适用于作用域名称(例如，在 namespace 或类中定义的类型和类型定义)，还是仅适用于全局 namespace 中的类型和类型定义？标准 C/C++ 库或 PO
c++ - _T ("...") 和 _RT ("...") 宏有什么区别？
我确定以前有人问过这个问题，但我无法搜索到文本。如果有人可以解释它们，请给我推荐一篇文章，或者给我正确的搜索查询，我将不胜感激。谢谢。最佳答案这只是一种预感，但看看 Wikipedia C++1
c++ - _T( ) 宏更改为 UNICODE 字符数据
我有一个 UNICODE 应用程序，我们使用 _T(x) 定义如下。 #if defined(_UNICODE) #define _T(x) L ##x #else #define _T(x) x #
C 类型命名约定，_t 或 ALLCAPS
我一直想知道是否存在任何命名约定，例如何时对类型使用 ALLCAPS 以及何时附加 _t(以及何时不使用任何东西？)。我知道以前 K&R 发布了各种关于如何使用 C 的文档，但我找不到任何相关内容。
c - _t(下划线-t)后面的类型代表什么？
这似乎是一个简单的问题，但我无法通过 Stack Overflow 搜索或 Google 找到它。类型后跟 _t 是什么意思？比如 int_t anInt; 我在 C 代码中经常看到它与硬件密切相关—
C++ 类型后缀 _t、_type 或无
C++ 有时使用后缀 _type关于类型定义(例如 std::vector::value_type )，有时_t (例如 std::size_t )，或者没有后缀(普通类，还有像 std::strin
c++ - _T 在 CString 中代表什么
字符串中的“T”代表什么。例如 _T("Hello")。我在需要 unicode 支持的项目中看到了这一点。它实际上告诉处理器什么最佳答案 _T 代表“文本”。当且仅当您使用 Unicode 支持编
c++ - 如何使 _t 版本的 SFINAE 结构公开静态成员值？
我的代码可以根据 C++ 类型识别您需要使用的 GL 类型。我想制作它的 _t 版本(如 std::decay_t 或 std::enable_if_t)但公开 int常量值 template st

bug小助手

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

Error compiling Cython file: Invalid types for `'>'` (`float64_t[::1]`, `float64_t`)(编译Cython文件时出错：`‘>’`(`flat64_t[：：1]`，`flat64_t`)的类型无效)

Context

Error

Files

setup.py

pyproject.toml

module.pyx