gpt4 book ai didi

c++ - CUDA:处理不同大小的数组

转载 作者:行者123 更新时间:2023-11-28 06:44:32 25 4
gpt4 key购买 nike

在此示例中,我尝试使用 10x9 数组中的值创建一个 10x8 数组。看起来我正在错误地访问内存,但我不确定我的错误在哪里。

C++ 中的代码类似于

for (int h = 0; h < height; h++){
for (int i = 0; i < (width-2); i++)
dd[h*(width-2)+i] = hi[h*(width-1)+i] + hi[h*(width-1)+i+1];
}

这就是我在 CUDA 中尝试的:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdint.h>

#include <iostream>

#define TILE_WIDTH 4

using namespace std;

__global__ void cudaOffsetArray(int height, int width, float *HI, float *DD){

int x = blockIdx.x * blockDim.x + threadIdx.x; // Col // width
int y = blockIdx.y * blockDim.y + threadIdx.y; // Row // height
int grid_width = gridDim.x * blockDim.x;
//int index = y * grid_width + x;

if ((x < (width - 2)) && (y < (height)))
DD[y * (grid_width - 2) + x] = (HI[y * (grid_width - 1) + x] + HI[y * (grid_width - 1) + x + 1]);
}

int main(){

int height = 10;
int width = 10;

float *HI = new float [height * (width - 1)];
for (int i = 0; i < height; i++){
for (int j = 0; j < (width - 1); j++)
HI[i * (width - 1) + j] = 1;
}

float *gpu_HI;
float *gpu_DD;
cudaMalloc((void **)&gpu_HI, (height * (width - 1) * sizeof(float)));
cudaMalloc((void **)&gpu_DD, (height * (width - 2) * sizeof(float)));

cudaMemcpy(gpu_HI, HI, (height * (width - 1) * sizeof(float)), cudaMemcpyHostToDevice);

dim3 dimGrid((width - 1) / TILE_WIDTH + 1, (height - 1)/TILE_WIDTH + 1, 1);
dim3 dimBlock(TILE_WIDTH, TILE_WIDTH, 1);

cudaOffsetArray<<<dimGrid,dimBlock>>>(height, width, gpu_HI, gpu_DD);

float *result = new float[height * (width - 2)];
cudaMemcpy(result, gpu_DD, (height * (width - 2) * sizeof(float)), cudaMemcpyDeviceToHost);

for (int i = 0; i < height; i++){
for (int j = 0; j < (width - 2); j++)
cout << result[i * (width - 2) + j] << " ";
cout << endl;
}

cudaFree(gpu_HI);
cudaFree(gpu_DD);
delete[] result;
delete[] HI;

system("pause");
}

我也在全局函数中尝试过这个:

if ((x < (width - 2)) && (y < (height)))
DD[y * (grid_width - 2) + (blockIdx.x - 2) * blockDim.x + threadIdx.x] =
(HI[y * (grid_width - 1) + (blockIdx.x - 1) * blockDim.x + threadIdx.x] +
HI[y * (grid_width - 1) + (blockIdx.x - 1) * blockDim.x + threadIdx.x + 1]);

最佳答案

要“修复”您的代码,请在内核的这一行中将每次使用的 grid_width 更改为 width:

    DD[y * (grid_width - 2) + x] = (HI[y * (grid_width - 1) + x] + HI[y * (grid_width - 1) + x + 1]);

像这样:

    DD[y * (width - 2) + x] = (HI[y * (width - 1) + x] + HI[y * (width - 1) + x + 1]);

解释:

您的grid_width:

dim3            dimGrid((width * 2 - 1) / TILE_WIDTH + 1, (height - 1)/TILE_WIDTH + 1, 1);
dim3 dimBlock(TILE_WIDTH, TILE_WIDTH, 1);

实际上并不对应于您的数组大小(10x10、10x9 或 10x8)。我不确定您为什么要在 x 维度中启动 2*width 线程,但这意味着您的线程数组比数据数组大得多。

所以当你在内核中使用grid_width时:

    DD[y * (grid_width - 2) + x] = (HI[y * (grid_width - 1) + x] + HI[y * (grid_width - 1) + x + 1]);

索引将是一个问题。如果您改为将上面的 grid_width 的每个实例更改为 width(对应于数据数组的实际宽度),我认为您将获得更好的索引。通常启动“额外线程”不是问题,因为您的内核中有一个线程检查行:

if ((x < (width - 2)) && (y < (height)))

但是当您启动额外的线程时,它会使您的网格变大,因此您无法使用网格维度正确索引到您的数据数组中。

关于c++ - CUDA:处理不同大小的数组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25246288/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com