NumPy 用户的 Cython¶

注意

此页面使用两种不同的语法变体

Cython 特定的 cdef 语法，旨在使类型声明简洁明了，并易于从 C/C++ 的角度阅读。
纯 Python 语法，允许在纯 Python 代码中进行静态 Cython 类型声明，遵循 PEP-484 类型提示和 PEP 526 变量注释。

要在 Python 语法中使用 C 数据类型，您需要在要编译的 Python 模块中导入特殊的 cython 模块，例如
```
import cython
```
如果您使用纯 Python 语法，我们强烈建议您使用最新的 Cython 3 版本，因为与 0.29.x 版本相比，这里已经进行了重大改进。

本教程面向没有 Cython 经验的 NumPy 用户。如果您对 Cython 有所了解，您可能想跳过“高效索引”部分。

主要考虑的场景是 NumPy 的最终用户，而不是 NumPy/SciPy 的开发。原因是 Cython 还没有能够以高级方式支持针对高维数组的通用函数。对于 SciPy 开发来说，这种限制比更具体的“最终用户”函数要严重得多。有关此方面的更多信息，请参见最后一节。

本教程的风格并不适合所有人，因此您也可以考虑

Kurt Smith 的 2015 年 SciPy 大会上的 Cython 视频教程。该演讲的幻灯片和笔记本在 github 上。
Cython 基本文档（参见 Cython 首页）。

Cython 概述¶

Cython 是一种编译器，它将类似 Python 的代码文件编译成 C 代码。但是，‘’Cython 不是 Python 到 C 的转换器’'。也就是说，它不会获取您的整个程序并将其“转换为 C”——相反，结果充分利用了 Python 运行时环境。您可以这样理解，您的代码仍然是 Python，因为它在 Python 运行时环境中运行，但它不是编译成解释的 Python 字节码，而是编译成本地机器代码（但增加了额外的语法，以便轻松嵌入更快的 C 类代码）。

这有两个重要的后果

速度。尽管速度取决于所涉及的程序，但速度会有很大差异。典型的 Python 数值程序往往会获得很少的收益，因为大部分时间都花在了以高级方式使用底层 C 代码上。但是，当添加类型信息时，for 循环式程序可以获得数量级的速度提升（并且因此成为一种现实的替代方案）。
轻松调用 C 代码。Cython 的目的之一是允许轻松包装 C 库。在 Cython 中编写代码时，您可以像调用 Python 代码一样轻松地调用 C 代码。

很少有 Python 结构尚未得到支持，尽管将所有 Python 代码编译成 Cython 是一个既定的目标，您可以在限制中看到与 Python 的区别。

您的 Cython 环境¶

使用 Cython 包括以下步骤

编写一个 .pyx 源文件
运行 Cython 编译器生成 C 文件
运行 C 编译器生成编译后的库
运行 Python 解释器并要求它导入该模块

但是，有几种选项可以自动执行这些步骤

SAGE 数学软件系统为从交互式命令行或通过笔记本界面（如 Maple/Mathematica）使用 Cython 和 NumPy 提供了出色的支持。参见此文档。
Cython 可以用作 Jupyter 笔记本中的扩展，使编译和使用 Cython 代码变得容易，只需在单元格顶部添加 %%cython。有关更多信息，请参阅使用 Jupyter 笔记本。
Cython 附带 pyximport 的一个版本，因此您可以将 pyx 文件动态导入 Python 并自动编译它们（请参阅使用 pyximport 编译）。
Cython 支持 setuptools，因此您可以非常轻松地创建构建脚本来自动化此过程，这是 Cython 实现的库和包的首选方法。请参阅基本 setup.py。
手动编译（见下文）

注意

如果使用除 SAGE 之外的其他交互式命令行环境，例如 IPython 或 Python 本身，则在重新编译模块时重新启动该过程非常重要。再次发出“import”语句是不够的。

安装¶

如果您已经拥有 C 编译器，只需执行

pip install Cython

否则，请参阅安装页面。

截至撰写本文时，SAGE 附带的 Cython 版本比本教程所需的版本旧。因此，如果您使用 SAGE，您应该下载最新的 Cython，然后执行

$ cd path/to/cython-distro
$ path-to-sage/sage -python setup.py install

这将把最新的 Cython 安装到 SAGE 中。

编译¶

手动编译¶

由于了解正在发生的事情始终很重要，因此我将在下面介绍手动方法。首先运行 Cython

$ cython yourmod.pyx

这将创建 yourmod.c，它是 Python 扩展模块的 C 源代码。一个有用的附加开关是 -a，它将生成一个文档 yourmod.html)，显示 Cython 代码逐行转换为哪个 C 代码。

然后我们编译 C 文件。这可能因您的系统而异，但 C 文件的构建方式应与 Python 的构建方式相同。Python 编写扩展的文档应该包含一些详细信息。在 Linux 上，这通常意味着类似以下内容

$ gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I/usr/include/python2.7 -o yourmod.so yourmod.c

gcc 应该能够访问 NumPy C 头文件，因此如果它们没有安装在 /usr/include/numpy 或类似位置，您可能需要为它们传递另一个选项。您只需要在 Cython 代码中编写以下内容时才需要提供 NumPy 头文件

cimport numpy

。

这将在同一目录中创建 yourmod.so，它可以通过使用正常的 import yourmod 语句由 Python 导入。

使用 setuptools 编译¶

Setuptools 允许我们创建 setup.py 文件来自动编译 Cython 文件和生成的 C 文件。

from setuptools import Extension, setup
from Cython.Build import cythonize
import numpy

extensions = [
    Extension("*", ["*.pyx"],
        include_dirs=[numpy.get_include()]),
]
setup(
    name="My hello app",
    ext_modules=cythonize(extensions),
)

NumPy 头文件的路径通过 include_dirs=[numpy.get_include()] 参数传递给 C 编译器。

注意

使用内存视图或使用 import numpy 导入 NumPy 并不意味着您必须添加 NumPy 包含文件的路径。只有在使用 cimport numpy 时才需要添加此路径。

尽管如此，您仍然可能会从编译器收到以下警告，因为 Cython 不会禁用旧的已弃用 Numpy API 的使用

.../include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]

在 Cython 3.0 中，您可以通过在构建中将 C 宏 NPY_NO_DEPRECATED_API 定义为 NPY_1_7_API_VERSION 来消除此警告，例如

# distutils: define_macros=NPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION

或（见下文）

Extension(
    ...,
    define_macros=[("NPY_NO_DEPRECATED_API", "NPY_1_7_API_VERSION")],
)

在较旧的 Cython 版本中，设置此宏将导致 C 编译失败，因为 Cython 生成的代码使用此已弃用的 C-API。但是，即使在最近的 NumPy 版本中，警告也没有负面影响。您可以忽略它，直到您（或您的库的用户）切换到删除此长期弃用 API 的较新 NumPy 版本，在这种情况下，您还需要使用 Cython 3.0 或更高版本。因此，您越早切换到 Cython 3.0，对您的用户就越好。

第一个 Cython 程序¶

您可以通过下载 Jupyter 笔记本来轻松执行本教程的代码。

下面的代码执行了 numpy 中此函数的等效操作

def compute_np(array_1, array_2, a, b, c):
    return np.clip(array_1, 2, 10) * a + array_2 * b + c

我们假设 array_1 和 array_2 是整数类型的二维 NumPy 数组，而 a、b 和 c 是三个 Python 整数。

此函数使用 NumPy，并且已经非常快，因此再次使用 Cython 可能有点过分。这是为了演示目的。尽管如此，我们将展示在更详细的代价下，我们实现了比 NumPy 更好的速度和内存效率。

此代码计算函数，其中两个维度上的循环被展开。它既是有效的 Python 代码，也是有效的 Cython 代码。我将它同时称为 compute_py.py（Python 版本）和 compute_cy.pyx（Cython 版本）——Cython 使用 .pyx 作为其文件后缀（但它也可以编译 .py 文件）。

import numpy as np


def clip(a, min_value, max_value):
    return min(max(a, min_value), max_value)


def compute(array_1, array_2, a, b, c):
    """
    This function must implement the formula
    np.clip(array_1, 2, 10) * a + array_2 * b + c

    array_1 and array_2 are 2D.
    """
    x_max = array_1.shape[0]
    y_max = array_1.shape[1]

    assert array_1.shape == array_2.shape

    result = np.zeros((x_max, y_max), dtype=array_1.dtype)

    for x in range(x_max):
        for y in range(y_max):
            tmp = clip(array_1[x, y], 2, 10)
            tmp = tmp * a + array_2[x, y] * b
            result[x, y] = tmp + c

    return result

这应该编译为生成 compute_cy.so（用于 Linux 系统）（在 Windows 系统上，这将是一个 .pyd 文件）。我们运行一个 Python 会话来测试 Python 版本（从 .py 文件导入）和编译后的 Cython 模块。

In [1]: import numpy as np
In [2]: array_1 = np.random.uniform(0, 1000, size=(3000, 2000)).astype(np.intc)
In [3]: array_2 = np.random.uniform(0, 1000, size=(3000, 2000)).astype(np.intc)
In [4]: a = 4
In [5]: b = 3
In [6]: c = 9
In [7]: def compute_np(array_1, array_2, a, b, c):
   ...:     return np.clip(array_1, 2, 10) * a + array_2 * b + c
In [8]: %timeit compute_np(array_1, array_2, a, b, c)
103 ms ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [9]: import compute_py
In [10]: compute_py.compute(array_1, array_2, a, b, c)
1min 10s ± 844 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [11]: import compute_cy
In [12]: compute_cy.compute(array_1, array_2, a, b, c)
56.5 s ± 587 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

目前还没有那么大的区别；因为 C 代码仍然完全按照 Python 解释器执行（例如，这意味着为每个使用的数字分配一个新对象）。

您可以通过在从命令行调用 Cython 时使用 -a，在使用 Jupyter Notebook 时使用 %%cython -a，或者在使用 setup.py 时使用 cythonize('compute_cy.pyx', annotate=True) 来查看 Python 交互和生成的 C 代码。查看生成的 html 文件，了解即使是最简单的语句也需要什么。您很快就会明白。我们需要给 Cython 提供更多信息；我们需要添加类型。

添加类型¶

为了添加类型，我们使用自定义的 Cython 语法，因此我们现在正在破坏 Python 源代码兼容性。阅读注释！

compute_typed.py¶

import numpy as np
import cython
# We now need to fix a datatype for our arrays. I've used the variable
# DTYPE for this, which is assigned to the usual NumPy runtime
# type info object.
DTYPE = np.intc

# @cython.cfunc means here that this function is a plain C function (so faster).
# To get all the benefits, we type the arguments and the return value.
@cython.exceptval(check=False)
@cython.cfunc
def clip(a: cython.int, min_value: cython.int, max_value: cython.int) -> cython.int:
    return min(max(a, min_value), max_value)


def compute(array_1, array_2, a: cython.int, b: cython.int, c: cython.int):

    # Annotation is also used within functions to type variables. It
    # can only be used at the top indentation level (there are non-trivial
    # problems with allowing them in other places, though we'd love to see
    # good and thought out proposals for it).
    x_max: cython.Py_ssize_t  = array_1.shape[0]
    y_max: cython.Py_ssize_t  = array_1.shape[1]

    assert array_1.shape == array_2.shape
    assert array_1.dtype == DTYPE
    assert array_2.dtype == DTYPE

    result = np.zeros((x_max, y_max), dtype=DTYPE)

    # It is very important to type ALL your variables. You do not get any
    # warnings if not, only much slower code (they are implicitly typed as
    # Python objects).
    # For the "tmp" variable, we want to use the same data type as is
    # stored in the array, so we use int because it correspond to np.intc.
    # NB! An important side-effect of this is that if "tmp" overflows its
    # datatype size, it will simply wrap around like in C, rather than raise
    # an error like in Python.

    tmp: cython.int

    # cython.Py_ssize_t is the proper C type for Python array indices.
    x: cython.Py_ssize_t
    y:  cython.Py_ssize_t

    for x in range(x_max):
        for y in range(y_max):

            tmp = clip(array_1[x, y], 2, 10)
            tmp = tmp * a + array_2[x, y] * b
            result[x, y] = tmp + c

    return result

compute_typed.pyx¶

import numpy as np

# We now need to fix a datatype for our arrays. I've used the variable
# DTYPE for this, which is assigned to the usual NumPy runtime
# type info object.
DTYPE = np.intc

# cdef means here that this function is a plain C function (so faster).
# To get all the benefits, we type the arguments and the return value.
cdef int clip(int a, int min_value, int max_value):
    return min(max(a, min_value), max_value)




def compute(array_1, array_2, int a, int b, int c):

    # The "cdef" keyword is also used within functions to type variables. It
    # can only be used at the top indentation level (there are non-trivial
    # problems with allowing them in other places, though we'd love to see
    # good and thought out proposals for it).
    cdef Py_ssize_t x_max = array_1.shape[0]
    cdef Py_ssize_t y_max = array_1.shape[1]

    assert array_1.shape == array_2.shape
    assert array_1.dtype == DTYPE
    assert array_2.dtype == DTYPE

    result = np.zeros((x_max, y_max), dtype=DTYPE)

    # It is very important to type ALL your variables. You do not get any
    # warnings if not, only much slower code (they are implicitly typed as
    # Python objects).
    # For the "tmp" variable, we want to use the same data type as is
    # stored in the array, so we use int because it correspond to np.intc.
    # NB! An important side-effect of this is that if "tmp" overflows its
    # datatype size, it will simply wrap around like in C, rather than raise
    # an error like in Python.

    cdef int tmp

    # Py_ssize_t is the proper C type for Python array indices.
    cdef Py_ssize_t x, y


    for x in range(x_max):
        for y in range(y_max):

            tmp = clip(array_1[x, y], 2, 10)
            tmp = tmp * a + array_2[x, y] * b
            result[x, y] = tmp + c

    return result

此时，请查看 compute_cy.pyx 和 compute_typed.pyx 生成的 C 代码。单击行以展开它们并查看相应的 C 代码。

尤其要注意 for-loops：在 compute_cy.c 中，这些是大约 20 行用于设置的 C 代码，而在 compute_typed.c 中，使用的是普通的 C for 循环。

在构建完此代码并继续进行我的（非常非正式的）基准测试后，我得到

In [13]: %timeit compute_typed.compute(array_1, array_2, a, b, c)
26.5 s ± 422 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

因此，添加类型确实使代码更快，但远没有达到 NumPy 的速度？

发生的事情是，此代码中花费的大部分时间都花在了以下几行上，而这些行的执行速度比纯 Python 慢

tmp = clip(array_1[x, y], 2, 10)
tmp = tmp * a + array_2[x, y] * b
result[x, y] = tmp + c

那么，是什么让这些行比纯 Python 版本慢得多呢？

array_1 和 array_2 仍然是 NumPy 数组，因此是 Python 对象，并期望 Python 整数作为索引。在这里，我们传递 C int 值。因此，每次 Cython 遇到此行时，它都必须将所有 C 整数转换为 Python int 对象。由于此行被频繁调用，因此它超过了从 range() 创建的纯 C 循环的速度优势。

此外，tmp * a + array_2[x, y] * b 返回一个 Python 整数，而 tmp 是一个 C 整数，因此 Cython 必须再次进行类型转换。最终，这些类型转换累加起来，使我们的计算速度非常慢。但是，这个问题可以通过使用内存视图轻松解决。

使用内存视图进行高效索引¶

仍然存在两个瓶颈降低了性能，一个是数组查找和赋值，另一个是 C/Python 类型转换。 [] 运算符仍然使用完整的 Python 操作 - 我们想要做的是以 C 速度直接访问数据缓冲区。

我们需要做的就是对 ndarray 对象的内容进行类型化。我们使用内存视图来实现这一点。在 Cython 文档中有一个专门介绍它的页面。

简而言之，内存视图是 C 结构，可以保存指向 NumPy 数组数据的指针以及所有必要的缓冲区元数据，以提供高效且安全的访问：维度、步长、项目大小、项目类型信息等。它们还支持切片，因此即使 NumPy 数组在内存中不连续，它们也能正常工作。它们可以用 C 整数索引，从而允许快速访问 NumPy 数组数据。

以下是声明整数内存视图的方法

foo: cython.int [:]        # 1D memoryview
foo: cython.int [:, :]     # 2D memoryview
foo: cython.int [:, :, :]  # 3D memoryview
...                      # You get the idea.

cdef int [:] foo         # 1D memoryview
cdef int [:, :] foo      # 2D memoryview
cdef int [:, :, :] foo   # 3D memoryview
...                      # You get the idea.

在我们的示例中，没有从 NumPy 数组复制数据到内存视图。顾名思义，它只是一个内存的“视图”。因此，我们可以使用视图 result_view 进行高效索引，并在最后返回包含我们操作数据的实际 NumPy 数组 result。

以下是在代码中使用它们的方法

compute_memview.py¶

import numpy as np
import cython

DTYPE = np.intc

@cython.cfunc
def clip(a: cython.int, min_value: cython.int, max_value: cython.int) -> cython.int:
    return min(max(a, min_value), max_value)


def compute(array_1: cython.int[:, :], array_2: cython.int[:, :],
        a: cython.int, b: cython.int, c: cython.int):

    x_max: cython.Py_ssize_t = array_1.shape[0]
    y_max: cython.Py_ssize_t = array_1.shape[1]

    # array_1.shape is now a C array, no it's not possible
    # to compare it simply by using == without a for-loop.
    # To be able to compare it to array_2.shape easily,
    # we convert them both to Python tuples.
    assert tuple(array_1.shape) == tuple(array_2.shape)

    result = np.zeros((x_max, y_max), dtype=DTYPE)
    result_view: cython.int[:, :] = result

    tmp: cython.int
    x: cython.Py_ssize_t
    y: cython.Py_ssize_t

    for x in range(x_max):
        for y in range(y_max):

            tmp = clip(array_1[x, y], 2, 10)
            tmp = tmp * a + array_2[x, y] * b
            result_view[x, y] = tmp + c

    return result

compute_memview.pyx¶

import numpy as np


DTYPE = np.intc


cdef int clip(int a, int min_value, int max_value):
    return min(max(a, min_value), max_value)


def compute(int[:, :] array_1, int[:, :] array_2, int a, int b, int c):


    cdef Py_ssize_t x_max = array_1.shape[0]
    cdef Py_ssize_t y_max = array_1.shape[1]

    # array_1.shape is now a C array, no it's not possible
    # to compare it simply by using == without a for-loop.
    # To be able to compare it to array_2.shape easily,
    # we convert them both to Python tuples.
    assert tuple(array_1.shape) == tuple(array_2.shape)

    result = np.zeros((x_max, y_max), dtype=DTYPE)
    cdef int[:, :] result_view = result

    cdef int tmp
    cdef Py_ssize_t x, y


    for x in range(x_max):
        for y in range(y_max):

            tmp = clip(array_1[x, y], 2, 10)
            tmp = tmp * a + array_2[x, y] * b
            result_view[x, y] = tmp + c

    return result

让我们看看现在访问速度有多快。

In [22]: %timeit compute_memview.compute(array_1, array_2, a, b, c)
22.9 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

请注意此更改的重要性。我们现在比 Python 的解释版本快 3081 倍，比 NumPy 快 4.5 倍。

内存视图也可以与切片一起使用，甚至可以与 Python 数组一起使用。查看内存视图页面，了解它们可以为您做什么。

进一步调整索引¶

数组查找仍然受到两个因素的减缓

执行边界检查。
对负索引进行检查并正确处理。上面的代码是显式编写的，因此它不使用负索引，并且它（希望）始终在边界内访问。

使用装饰器，我们可以停用这些检查

...
import cython
@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
def compute(array_1: cython.int[:, :], array_2: cython.int[:, :],
            a: cython.int, b: cython.int, c: cython.int):
...

...
cimport cython
@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
def compute(int[:, :] array_1, int[:, :] array_2, int a, int b, int c):
...

现在不执行边界检查（并且，作为副作用，如果您“确实”碰巧越界访问，您将在最佳情况下使程序崩溃，在最坏情况下会损坏数据）。可以通过多种方式切换边界检查模式，有关更多信息，请参见编译器指令。

In [23]: %timeit compute_index.compute(array_1, array_2, a, b, c)
16.8 ms ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

我们比 NumPy 版本快（6.2 倍）。NumPy 编写得非常好，但不会延迟执行操作，导致内存中出现大量中间复制操作。我们的版本非常节省内存，并且对缓存友好，因为我们可以在对数据进行单次运行中执行操作。

警告

速度是有代价的。特别是将类型化对象（如示例代码中的 array_1、array_2 和 result_view）设置为 None 可能很危险。将此类对象设置为 None 是完全合法的，但您只能对它们进行的操作是检查它们是否为 None。所有其他使用（属性查找或索引）都可能导致段错误或损坏数据（而不是像在 Python 中那样引发异常）。

实际规则稍微复杂一些，但主要信息很清楚：不要在不知道它们没有设置为 None 的情况下使用类型化对象。

将 NumPy 数组声明为连续¶

为了获得额外的速度提升，如果您知道提供的 NumPy 数组在内存中是连续的，您可以将内存视图声明为连续的。

我们以一个具有 3 个维度的数组为例。如果您想告诉 Cython 数据是 C 连续的，您需要像这样声明内存视图

a: cython.int[:,:,::1]

cdef int [:,:,::1] a

如果您想告诉 Cython 数据是 Fortran 连续的，您需要像这样声明内存视图

a: cython.int[::1, :, :]

cdef int [::1, :, :] a

如果您对这一切都不理解，您可以跳过这一部分，将数组声明为连续的会限制您函数的使用，因为它会拒绝数组切片作为输入。如果您仍然想了解连续数组的含义，您可以查看 StackOverflow 上的这个答案。

为了说明数字，以下是将内存视图声明为连续的可以获得的速度提升

In [23]: %timeit compute_contiguous.compute(array_1, array_2, a, b, c)
11.1 ms ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

我们现在比 NumPy 版本快了大约 9 倍，比纯 Python 版本快了 6300 倍！

使函数更简洁¶

声明类型可能会使您的代码变得冗长。如果您不介意 Cython 推断变量的 C 类型，您可以在文件顶部使用 infer_types=True 编译器指令。它将为您节省大量的打字工作。

请注意，由于类型声明必须发生在最顶层的缩进级别，Cython 不会推断在其他缩进级别中首次声明的变量的类型。这会改变我们代码的含义。这就是为什么我们仍然需要手动声明 tmp、x 和 y 变量的类型。

实际上，手动给出 tmp 变量的类型在使用融合类型时会很有用。

# cython: infer_types=True
import numpy as np
import cython

DTYPE = np.intc

@cython.cfunc
def clip(a: cython.int, min_value: cython.int, max_value: cython.int) -> cython.int:
    return min(max(a, min_value), max_value)


@cython.boundscheck(False)
@cython.wraparound(False)
def compute(array_1: cython.int[:, ::1], array_2: cython.int[:, ::1],
            a: cython.int, b: cython.int, c: cython.int):

    x_max = array_1.shape[0]
    y_max = array_1.shape[1]

    assert tuple(array_1.shape) == tuple(array_2.shape)

    result = np.zeros((x_max, y_max), dtype=DTYPE)
    result_view: cython.int[:, ::1] = result

    tmp: cython.int
    x: cython.Py_ssize_t
    y: cython.Py_ssize_t

    for x in range(x_max):
        for y in range(y_max):

            tmp = clip(array_1[x, y], 2, 10)
            tmp = tmp * a + array_2[x, y] * b
            result_view[x, y] = tmp + c

    return result

# cython: infer_types=True
import numpy as np
cimport cython

DTYPE = np.intc


cdef int clip(int a, int min_value, int max_value):
    return min(max(a, min_value), max_value)


@cython.boundscheck(False)
@cython.wraparound(False)
def compute(int[:, ::1] array_1, int[:, ::1] array_2, int a, int b, int c):


    x_max = array_1.shape[0]
    y_max = array_1.shape[1]

    assert tuple(array_1.shape) == tuple(array_2.shape)

    result = np.zeros((x_max, y_max), dtype=DTYPE)
    cdef int[:, ::1] result_view = result

    cdef int tmp
    cdef Py_ssize_t x, y


    for x in range(x_max):
        for y in range(y_max):

            tmp = clip(array_1[x, y], 2, 10)
            tmp = tmp * a + array_2[x, y] * b
            result_view[x, y] = tmp + c

    return result

我们现在进行速度测试

In [24]: %timeit compute_infer_types.compute(array_1, array_2, a, b, c)
11.5 ms ± 261 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

瞧，速度没有改变。

更通用的代码¶

所有这些速度提升都很不错，但是添加类型会限制我们的代码。目前，这意味着我们的函数只能使用具有 np.intc 类型的 NumPy 数组。是否可以使我们的代码适用于多种 NumPy 数据类型？

是的，借助名为融合类型的新功能。您可以在文档的这一部分中了解更多信息。它类似于 C++ 的模板。它在编译时生成多个函数声明，然后在运行时根据提供的参数类型选择正确的函数。通过在 if 条件中比较类型，还可以根据特定数据类型执行完全不同的代码路径。

在我们的示例中，由于我们不再访问输入数组的 NumPy 的 dtype，我们使用这些 if-else 语句来了解应该为输出数组使用哪种 NumPy 数据类型。

在这种情况下，我们的函数现在适用于整数、双精度浮点数和单精度浮点数。

# cython: infer_types=True
import numpy as np
import cython

my_type = cython.fused_type(cython.int, cython.double, cython.longlong)



@cython.exceptval(check=False)
@cython.cfunc
def clip(a: my_type, min_value: my_type, max_value: my_type) -> my_type:
    return min(max(a, min_value), max_value)


@cython.boundscheck(False)
@cython.wraparound(False)
def compute(array_1: my_type[:, ::1], array_2: my_type[:, ::1], a: my_type, b: my_type, c: my_type):

    x_max = array_1.shape[0]
    y_max = array_1.shape[1]

    assert tuple(array_1.shape) == tuple(array_2.shape)

    if my_type is cython.int:
        dtype = np.intc
    elif my_type is cython.double:
        dtype = np.double
    elif my_type is cython.longlong:
        dtype = np.longlong

    result = np.zeros((x_max, y_max), dtype=dtype)
    result_view: my_type[:, ::1] = result

    tmp: my_type
    x: cython.Py_ssize_t
    y: cython.Py_ssize_t

    for x in range(x_max):
        for y in range(y_max):

            tmp = clip(array_1[x, y], 2, 10)
            tmp = tmp * a + array_2[x, y] * b
            result_view[x, y] = tmp + c

    return result

# cython: infer_types=True
import numpy as np
cimport cython

ctypedef fused my_type:
    int
    double
    long long


cdef my_type clip(my_type a, my_type min_value, my_type max_value):
    return min(max(a, min_value), max_value)


@cython.boundscheck(False)
@cython.wraparound(False)
def compute(my_type[:, ::1] array_1, my_type[:, ::1] array_2, my_type a, my_type b, my_type c):

    x_max = array_1.shape[0]
    y_max = array_1.shape[1]

    assert tuple(array_1.shape) == tuple(array_2.shape)

    if my_type is int:
        dtype = np.intc
    elif my_type is double:
        dtype = np.double
    elif my_type is cython.longlong:
        dtype = np.longlong

    result = np.zeros((x_max, y_max), dtype=dtype)
    cdef my_type[:, ::1] result_view = result

    cdef my_type tmp
    cdef Py_ssize_t x, y


    for x in range(x_max):
        for y in range(y_max):

            tmp = clip(array_1[x, y], 2, 10)
            tmp = tmp * a + array_2[x, y] * b
            result_view[x, y] = tmp + c

    return result

我们可以检查输出类型是否正确

>>> compute(array_1, array_2, a, b, c).dtype
dtype('int32')
>>> compute(array_1.astype(np.double), array_2.astype(np.double), a, b, c).dtype
dtype('float64')

我们现在进行速度测试

In [25]: %timeit compute_fused_types.compute(array_1, array_2, a, b, c)
11.5 ms ± 258 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

在编译时创建了更多版本的函数。因此，使用整数执行此函数的速度没有改变是有道理的。

使用多线程¶

Cython 支持 OpenMP。它还有一些围绕它的不错的包装器，例如 prange() 函数。您可以在使用并行性中查看有关 Cython 和并行性的更多信息。由于我们执行逐元素操作，因此可以轻松地将工作分配到多个线程中。重要的是不要忘记将正确的参数传递给编译器以启用 OpenMP。在使用 Jupyter 笔记本时，您应该使用以下单元格魔法

%%cython --force
# distutils: extra_compile_args=-fopenmp
# distutils: extra_link_args=-fopenmp

对于 MSVC（在 Windows 上），您应该使用 /openmp 而不是 -fopenmp。必须释放 GIL（请参阅释放 GIL），这就是为什么我们将 clip() 函数声明为 nogil 的原因。

# distutils: extra_compile_args=-fopenmp
# distutils: extra_link_args=-fopenmp

import numpy as np
import cython
from cython.parallel import prange

my_type = cython.fused_type(cython.int, cython.double, cython.longlong)


# We declare our plain c function nogil
@cython.exceptval(check=False)
@cython.nogil
@cython.cfunc
def clip(a: my_type, min_value: my_type, max_value: my_type) -> my_type:
    return min(max(a, min_value), max_value)


@cython.boundscheck(False)
@cython.wraparound(False)
def compute(array_1: my_type[:, ::1], array_2: my_type[:, ::1], a: my_type, b: my_type, c: my_type):

    x_max: cython.Py_ssize_t = array_1.shape[0]
    y_max: cython.Py_ssize_t = array_1.shape[1]

    assert tuple(array_1.shape) == tuple(array_2.shape)

    if my_type is cython.int:
        dtype = np.intc
    elif my_type is cython.double:
        dtype = np.double
    elif my_type is cython.longlong:
        dtype = np.longlong

    result = np.zeros((x_max, y_max), dtype=dtype)
    result_view: my_type[:, ::1] = result

    tmp: my_type
    x: cython.Py_ssize_t
    y: cython.Py_ssize_t

    # We use prange here.
    for x in prange(x_max, nogil=True):
        for y in range(y_max):

            tmp = clip(array_1[x, y], 2, 10)
            tmp = tmp * a + array_2[x, y] * b
            result_view[x, y] = tmp + c

    return result

# distutils: extra_compile_args=-fopenmp
# distutils: extra_link_args=-fopenmp

import numpy as np
cimport cython
from cython.parallel import prange

ctypedef fused my_type:
    int
    double
    long long


# We declare our plain c function nogil
cdef my_type clip(my_type a, my_type min_value, my_type max_value) noexcept nogil:
    return min(max(a, min_value), max_value)


@cython.boundscheck(False)
@cython.wraparound(False)
def compute(my_type[:, ::1] array_1, my_type[:, ::1] array_2, my_type a, my_type b, my_type c):

    cdef Py_ssize_t x_max = array_1.shape[0]
    cdef Py_ssize_t y_max = array_1.shape[1]

    assert tuple(array_1.shape) == tuple(array_2.shape)

    if my_type is int:
        dtype = np.intc
    elif my_type is double:
        dtype = np.double
    elif my_type is cython.longlong:
        dtype = np.longlong

    result = np.zeros((x_max, y_max), dtype=dtype)
    cdef my_type[:, ::1] result_view = result

    cdef my_type tmp
    cdef Py_ssize_t x, y


    # We use prange here.
    for x in prange(x_max, nogil=True):
        for y in range(y_max):

            tmp = clip(array_1[x, y], 2, 10)
            tmp = tmp * a + array_2[x, y] * b
            result_view[x, y] = tmp + c

    return result

注意

目前，Cython 在每次调用 clip() 函数后都会检查是否发生了异常。检查异常需要持有 GIL，这会在 nogil 循环中造成开销。这里需要检查是由于函数返回融合类型导致的错误（有关详细信息，请参见 #5586）。为了避免获取 GIL，该函数被声明为 noexcept 或 @cython.exceptval(check=False)。有关更多详细信息，请参见错误返回值部分。

我们可以通过最小的努力获得显著的速度提升。

In [25]: %timeit compute_prange.compute(array_1, array_2, a, b, c)
9.33 ms ± 412 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

我们现在比纯 Python 版本快 7558 倍，比 NumPy 快 11.1 倍！

下一步去哪里？¶

如果您想学习如何使用 Cython 利用 BLAS 或 LAPACK，您可以观看 Ian Henriksen 在 SciPy 2015 上的演讲。
如果您想学习如何在 Cython 中使用 Pythran 作为后端，您可以在 Pythran 作为 NumPy 后端中查看。请注意，使用 Pythran 仅适用于旧的缓冲区语法，目前尚不支持内存视图。