python - 如何将大于 VRAM 大小的数据传递到 GPU 中？-6ren

python - 如何将大于 VRAM 大小的数据传递到 GPU 中？

转载作者：行者123 更新时间：2023-12-02 19:59:32

我试图将比 VRAM 更多的数据传递到 GPU，这会导致以下错误。 CudaAPIError:调用 cuMemAlloc 导致 CUDA_ERROR_OUT_OF_MEMORY

我创建了此代码来重现问题:

from numba import cuda
import numpy as np


@cuda.jit()
def addingNumbers (big_array, big_array2, save_array):
    i = cuda.grid(1)
    if i < big_array.shape[0]:
        for j in range (big_array.shape[1]):
            save_array[i][j] = big_array[i][j] * big_array2[i][j]



big_array = np.random.random_sample((1000000, 500))
big_array2  = np.random.random_sample((1000000, 500))
save_array = np.zeros(shape=(1000000, 500))


arraysize = 1000000
threadsperblock = 64
blockspergrid = (arraysize + (threadsperblock - 1))


d_big_array = cuda.to_device(big_array)
d_big_array2 = cuda.to_device(big_array2)
d_save_array = cuda.to_device(save_array)

addingNumbers[blockspergrid, threadsperblock](d_big_array, d_big_array2, d_save_array)

save_array = d_save_array.copy_to_host()

有没有办法将数据动态传递到 GPU 中，以便能够处理比 VRAM 可以容纳的更多的数据？如果没有，建议将所有这些数据手动传递到 GPU 的方法是什么。使用 dask_cuda 是一种选择，还是类似性质的东西？

最佳答案

关于如何处理更大的问题(即数据集)并将其分解为多个部分，并在 numba CUDA 中分段处理处理的一个写得很好的示例是 here 。特别是，我们感兴趣的变体是 pricer_cuda_overlap.py。不幸的是，该示例使用了我认为在 Accelerate.cuda.rand 中已弃用的随机数生成功能，因此它不能在今天的 numba 中直接运行(我认为)。

但是，就此处问题的目的而言，随机数生成过程是无关紧要的，因此我们可以简单地删除它，而不会影响重要的观察结果。接下来是由该示例中的各个文件中的各个部分组装而成的单个文件:

$ cat t45.py
#! /usr/bin/env python
"""
This version demonstrates copy-compute overlapping through multiple streams.
"""
from __future__ import print_function

import math
import sys

import numpy as np

from numba import cuda, jit

from math import sqrt, exp
from timeit import default_timer as timer
from collections import deque

StockPrice = 20.83
StrikePrice = 21.50
Volatility = 0.021  #  per year
InterestRate = 0.20

Maturity = 5. / 12.

NumPath = 500000
NumStep = 200

def driver(pricer, pinned=False):
    paths = np.zeros((NumPath, NumStep + 1), order='F')
    paths[:, 0] = StockPrice
    DT = Maturity / NumStep

    if pinned:
        from numba import cuda
        with cuda.pinned(paths):
            ts = timer()
            pricer(paths, DT, InterestRate, Volatility)
            te = timer()
    else:
        ts = timer()
        pricer(paths, DT, InterestRate, Volatility)
        te = timer()

    ST = paths[:, -1]
    PaidOff = np.maximum(paths[:, -1] - StrikePrice, 0)
    print('Result')
    fmt = '%20s: %s'
    print(fmt % ('stock price', np.mean(ST)))
    print(fmt % ('standard error', np.std(ST) / sqrt(NumPath)))
    print(fmt % ('paid off', np.mean(PaidOff)))
    optionprice = np.mean(PaidOff) * exp(-InterestRate * Maturity)
    print(fmt % ('option price', optionprice))

    print('Performance')
    NumCompute = NumPath * NumStep
    print(fmt % ('Mstep/second', '%.2f' % (NumCompute / (te - ts) / 1e6)))
    print(fmt % ('time elapsed', '%.3fs' % (te - ts)))

class MM(object):
    """Memory Manager

    Maintain a freelist of device memory for reuse.
    """
    def __init__(self, shape, dtype, prealloc):
        self.device = cuda.get_current_device()
        self.freelist = deque()
        self.events = {}
        for i in range(prealloc):
            gpumem = cuda.device_array(shape=shape, dtype=dtype)
            self.freelist.append(gpumem)
            self.events[gpumem] = cuda.event(timing=False)

    def get(self, stream=0):
        assert self.freelist
        gpumem = self.freelist.popleft()
        evnt = self.events[gpumem]
        if not evnt.query(): # not ready?
            # querying is faster then waiting
            evnt.wait(stream=stream) # future works must wait
        return gpumem

    def free(self, gpumem, stream=0):
        evnt = self.events[gpumem]
        evnt.record(stream=stream)
        self.freelist.append(gpumem)


if sys.version_info[0] == 2:
    range = xrange

@jit('void(double[:], double[:], double, double, double, double[:])',
     target='cuda')
def cu_step(last, paths, dt, c0, c1, normdist):
    i = cuda.grid(1)
    if i >= paths.shape[0]:
        return
    noise = normdist[i]
    paths[i] = last[i] * math.exp(c0 * dt + c1 * noise)

def monte_carlo_pricer(paths, dt, interest, volatility):
    n = paths.shape[0]
    num_streams = 2

    part_width = int(math.ceil(float(n) / num_streams))
    partitions = [(0, part_width)]
    for i in range(1, num_streams):
        begin, end = partitions[i - 1]
        begin, end = end, min(end + (end - begin), n)
        partitions.append((begin, end))
    partlens = [end - begin for begin, end in partitions]

    mm = MM(shape=part_width, dtype=np.double, prealloc=10 * num_streams)

    device = cuda.get_current_device()
    blksz = device.MAX_THREADS_PER_BLOCK
    gridszlist = [int(math.ceil(float(partlen) / blksz))
                  for partlen in partlens]

    strmlist = [cuda.stream() for _ in range(num_streams)]

    # Allocate device side array - in original example this would be initialized with random numbers
    d_normlist = [cuda.device_array(partlen, dtype=np.double, stream=strm)
                  for partlen, strm in zip(partlens, strmlist)]

    c0 = interest - 0.5 * volatility ** 2
    c1 = volatility * math.sqrt(dt)

    # Configure the kernel
    # Similar to CUDA-C: cu_monte_carlo_pricer<<<gridsz, blksz, 0, stream>>>
    steplist = [cu_step[gridsz, blksz, strm]
               for gridsz, strm in zip(gridszlist, strmlist)]

    d_lastlist = [cuda.to_device(paths[s:e, 0], to=mm.get(stream=strm))
                  for (s, e), strm in zip(partitions, strmlist)]

    for j in range(1, paths.shape[1]):

        d_pathslist = [cuda.to_device(paths[s:e, j], stream=strm,
                                      to=mm.get(stream=strm))
                       for (s, e), strm in zip(partitions, strmlist)]

        for step, args in zip(steplist, zip(d_lastlist, d_pathslist, d_normlist)):
            d_last, d_paths, d_norm = args
            step(d_last, d_paths, dt, c0, c1, d_norm)

        for d_paths, strm, (s, e) in zip(d_pathslist, strmlist, partitions):
            d_paths.copy_to_host(paths[s:e, j], stream=strm)
            mm.free(d_paths, stream=strm)
        d_lastlist = d_pathslist

    for strm in strmlist:
        strm.synchronize()

if __name__ == '__main__':
    driver(monte_carlo_pricer, pinned=True)
$ python t45.py
Result
         stock price: 22.6720614385
      standard error: 0.0
            paid off: 1.17206143849
        option price: 1.07834858009
Performance
        Mstep/second: 336.40
        time elapsed: 0.297s
$

这个示例中发生了很多事情，如何在 CUDA 中编写管道/重叠代码的一般主题本身就是一个完整的答案，所以我将只介绍重点内容。 this blog post 很好地涵盖了一般主题。尽管考虑的是 CUDA C++，而不是 numba CUDA (python)。然而，numba CUDA 中大多数感兴趣的项目与 CUDA C++ 中的相应项目之间存在 1:1 对应关系。因此，我假设您已经了解 CUDA 流等基本概念以及如何使用它们来安排异步并发事件。

那么这个例子在做什么呢？我将主要关注 CUDA 方面。

为了实现复制和计算操作的重叠，输入数据(路径)会转换为主机上的 CUDA 固定内存
为了以 block 的形式处理工作，定义了内存管理器 (MM)，它将允许在处理过程中重用设备内存的 block 分配。
Python 列表的创建是为了表示 block 处理的顺序。有一个列表定义每个 block 或分区的开始和结束。有一个列表定义了要使用的 cuda 流的顺序。有一个 CUDA 内核将使用的数据数组分区列表。
然后，通过这些列表，可以发布“深度优先”的作品。对于每个流，该流所需的数据( block )将传输到设备(排队等待传输)，启动处理该数据的内核(排队)，并将该 block 的结果发送回设备主机内存已排队。此过程在 monte_carlo_pricer 的 for j 循环中重复执行步骤数 (paths.shape[1])。

当我使用分析器运行上述代码时，我们可以看到如下所示的时间线:

在这种特殊情况下，我在 Quadro K2000 上运行它，这是一种旧的小型 GPU，只有一个复制引擎。因此，我们在配置文件中看到最多 1 个复制操作与 CUDA 内核事件重叠，并且没有复制操作与其他复制操作重叠。但是，如果我在具有 2 个复制引擎的设备上运行此程序，我希望可以实现更紧凑/更密集的时间线，同时重叠 2 个复制操作和一个计算操作，以实现最大吞吐量。为了实现这一点，使用中的流 (num_streams) 也必须增加到至少 3。

不保证此处的代码没有缺陷。它是出于演示目的而提供的。使用它需要您自担风险。

关于python - 如何将大于 VRAM 大小的数据传递到 GPU 中？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56176077/

文章推荐： angular - 类型 'nome' 上不存在属性 'FormGroup'

文章推荐： Opengl 4.0 纹理问题

文章推荐： sql - 如何在 Azure databricks SQL 中将字段值转换为逗号分隔

文章推荐： php - 有条件地从 WooCommerce 我的帐户订单中删除取消按钮

java - 使用 Parcelable 传递 Arraylist，传递 null
Github:https://github.com/jjvang/PassIntentDemo 我一直在关注有关按 Intent 传递对象的教程:https://www.javacodegeeks.c
c# - 试图让我的文本字段在按下按钮时作为 FormCollection 传递，但 FormCollection 以 null 传递
我有一个 View ，其中包含自动生成的 text 类型的 input 框。当我单击“通过电子邮件发送结果”按钮时，代码会将您带到 CalculatedResults Controller 中的 Em
docker - 传递--env变量现在可以与ENTRYPOINT一起使用
我有一个基本的docker镜像，我将以此为基础构建自己的镜像。我没有基础镜像的Dockerfile。基本上，基本镜像使用两个--env arg，一个接受其许可证，一个选择在容器中激活哪个框架。我可以
f# - 传递 ** 作为参数
假设我想计算 2^n 的总和，n 范围从 0 到 100。我可以编写以下内容: seq { 0 .. 100 } |> Seq.sumBy ((**) 2I) 但是，这与 (*) 或其他运算符/函数不
php 传递 URL
我有这个网址: http://www.example.com/get_url.php?ID=100&Link=http://www.test.com/page.php?l=1&m=7 当我打印 $_G
javascript - 传递 createObjectURL
我想将 window.URL.createObjectURL(file) 创建的地址传递给 dancer.js 但我得到 GET blob:http%3A//localhost/b847c5cd-aa
C++ 传递 typedef
我想知道如何将 typedef 传递给函数。例如: typedef int box[3][3]; box empty, *board[3][3]; 我如何将 board 传递给函数？我
ios - 传递 NSManagedObject
我正在将一些代码从我的 Controller 移动到核心数据应用程序中的模型。我编写了一个方法，该方法为我定期发出的特定获取请求返回 NSManagedObjectID。 + (NSManagedO
javascript - 将类型化数组作为 any[] 传递？
为什么我不能将类型化数组传递到采用 any[] 的函数/构造函数中？ typedArray = new MyType[ ... ]; items = new ko.observableArray(ty
javascript - 传递 "this"作为参数
我是一名新的 Web 开发人员，正在学习 html5 和 javascript。我有一个带有“选项卡”的网页，可以使网页的某些部分消失并重新出现。链接如下: HOME 和 JavaScript 函
c# - 将代码作为参数C#传递
我试图将对函数的引用作为参数传递很难解释我会写一些伪代码示例 (calling function) function(hello()); function(pass) { if this =
c# - 传递 DataTypeEnum
我在尝试调用我正在创建的 C# 项目中的函数时遇到以下错误: System.Runtime.InteropServices.COMException: Operation is not allowed
linux - 传递\*作为参数的参数
使用 ksh。尝试重用当前脚本而不修改它，基本上可以归结为如下内容: `expr 5 $1 $2` 如何将乘法命令 (*) 作为参数 $1 传递？我首先尝试使用“*”，甚至是\*，但没有用。我尝试
Java - 传递 "this"时在对象属性中获取空值
我一直在研究“Play for Java”这本书，这本书非常棒。我对 Java 还是很陌生，但我一直在关注这些示例，我有点卡在第 3 章上了。可以在此处找到代码:Play for Java on Gi
javascript - js中的函数按值或引用复制/传递
我知道 Javascript 中的对象是通过引用复制/传递的。但是函数呢？当我跳到一些令人困惑的地方时，我正在尝试这段代码。这是代码片段: x = function() { console.log(
python - 传递 !, !=, ~, <, > 作为参数
我希望能够像这样传递参数: fn(a>=b) or fn(a!=b) 我在 DjangoORM 和 SQLAlchemy 中看到了这种行为，但我不知道如何实现它。最佳答案 ORM 使用 specia
javascript - 更新依赖会破坏另一个正在使用它的依赖(传递)
在我的 Angular 项目中，我最近将 rxjs 升级到版本 6。现在，来自 npm 的模块(在 node_modules 文件夹内)由于一些破坏性更改而失败(旧的进口不再有效)。我为我的代码调整了
c - 传递 `*` 时命令行参数编号不正确
这个问题在这里已经有了答案: The issue of * in Command line argument (6 个答案) 关闭 3 年前。我正在编写一个关于反向波兰表示法的 C 程序，它通过命
javascript - 传递 $(this) 作为参数？
$(document).ready(function() { function GetDeals() { alert($(this).attr("id")); } $('.filter
perl中子程序中参数的两种引用(传递)方式介绍
下面是一个例子：复制代码代码如下: use strict; #这里是两个数组 my @i =('1','2','3'); my @j =('a','b','c'); &n

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 如何将大于 VRAM 大小的数据传递到 GPU 中？