gpt4 book ai didi

arrays - 将二维数组发送到 Cuda 内核

转载 作者:行者123 更新时间:2023-12-04 15:25:20 25 4
gpt4 key购买 nike

我在理解如何将二维数组发送到 Cuda 时遇到了一些麻烦。我有一个程序可以解析一个大文件,每行有 30 个数据点。我一次读取大约 10 行,然后为每个行和项目创建一个矩阵(所以在我的 10 行和 30 个数据点的示例中,它是 int list[10][30]; 我的目标是将此数组发送到我的内核并让每个块处理一行(我已经让它在普通 C 中完美运行,但 Cuda 更具挑战性)。

这是我到目前为止所做的但没有运气(注意:sizeofbucket = rows,和 sizeOfBucketsHoldings = items in row...我知道我应该因为奇怪的变量名而获奖):

    int list[sizeOfBuckets][sizeOfBucketsHoldings]; //this is created at the start of the file and I can confirmed its filled with the correct data
#define sizeOfBuckets 10 //size of buckets before sending to process list
#define sizeOfBucketsHoldings 30
//Cuda part
//define device variables
int *dev_current_list[sizeOfBuckets][sizeOfBucketsHoldings];
//time to malloc the 2D array on device
size_t pitch;
cudaMallocPitch((int**)&dev_current_list, (size_t *)&pitch, sizeOfBucketsHoldings * sizeof(int), sizeOfBuckets);

//copy data from host to device
cudaMemcpy2D( dev_current_list, pitch, list, sizeOfBuckets * sizeof(int), sizeOfBuckets * sizeof(int), sizeOfBucketsHoldings * sizeof(int),cudaMemcpyHostToDevice );

process_list<<<count,1>>> (sizeOfBuckets, sizeOfBucketsHoldings, dev_current_list, pitch);
//free memory of device
cudaFree( dev_current_list );


__global__ void process_list(int sizeOfBuckets, int sizeOfBucketsHoldings, int *current_list, int pitch) {
int tid = blockIdx.x;
for (int r = 0; r < sizeOfBuckets; ++r) {
int* row = (int*)((char*)current_list + r * pitch);
for (int c = 0; c < sizeOfBucketsHoldings; ++c) {
int element = row[c];
}
}

我得到的错误是:
main.cu(266): error: argument of type "int *(*)[30]" is incompatible with parameter of type "int *"
1 error detected in the compilation of "/tmp/tmpxft_00003f32_00000000-4_main.cpp1.ii".

第 266 行是内核调用 process_list<<<count,1>>> (count, countListItem, dev_current_list, pitch);我认为问题是我试图在我的函数中将数组创建为 int * 但我还能如何创建它?在我的纯 C 代码中,我使用 int current_list[num_of_rows][num_items_in_row]这有效,但我无法在 Cuda 中获得相同的结果。

我的最终目标很简单,我只想让每个块处理每一行(sizeOfBuckets),然后让它循环遍历该行中的所有项目(sizeOfBucketHoldings)。我最初只是做了一个普通的 cudamalloc 和 cudaMemcpy,但它不起作用,所以我环顾四周,发现了 MallocPitch 和 2dcopy(两者都不在我的 cuda by example 书中),我一直在尝试研究示例,但它们似乎给我同样的错误(我目前正在阅读 CUDA_C 编程指南,在第 22 页找到了这个想法,但仍然没有运气)。有任何想法吗?或建议在哪里看?

编辑:
为了测试这一点,我只想将每一行的值加在一起(我通过示例数组添加示例从 cuda 复制了逻辑)。
我的内核:
__global__ void process_list(int sizeOfBuckets, int sizeOfBucketsHoldings, int *current_list, size_t pitch, int *total) {
//TODO: we need to flip the list as well
int tid = blockIdx.x;
for (int c = 0; c < sizeOfBucketsHoldings; ++c) {
total[tid] = total + current_list[tid][c];
}
}

这是我在主文件中声明总数组的方法:
int *dev_total;
cudaMalloc( (void**)&dev_total, sizeOfBuckets * sizeof(int) );

最佳答案

您的代码中有一些错误。

  • 然后将主机数组复制到设备,您应该传递一维主机指针。参见 function signature .
  • 您不需要为设备内存分配静态二维数组。它在主机内存中创建静态数组,然后您将其重新创建为设备数组。请记住,它也必须是一维数组。看到这个 function signature .

  • 这个例子应该可以帮助你分配内存:
    __global__ void process_list(int sizeOfBucketsHoldings, int* total, int* current_list, int pitch)
    {
    int tid = blockIdx.x;
    total[tid] = 0;
    for (int c = 0; c < sizeOfBucketsHoldings; ++c)
    {
    total[tid] += *((int*)((char*)current_list + tid * pitch) + c);
    }
    }

    int main()
    {
    size_t sizeOfBuckets = 10;
    size_t sizeOfBucketsHoldings = 30;

    size_t width = sizeOfBucketsHoldings * sizeof(int);//ned to be in bytes
    size_t height = sizeOfBuckets;

    int* list = new int [sizeOfBuckets * sizeOfBucketsHoldings];// one dimensional
    for (int i = 0; i < sizeOfBuckets; i++)
    for (int j = 0; j < sizeOfBucketsHoldings; j++)
    list[i *sizeOfBucketsHoldings + j] = i;

    size_t pitch_h = sizeOfBucketsHoldings * sizeof(int);// always in bytes

    int* dev_current_list;
    size_t pitch_d;
    cudaMallocPitch((int**)&dev_current_list, &pitch_d, width, height);

    int *test;
    cudaMalloc((void**)&test, sizeOfBuckets * sizeof(int));
    int* h_test = new int[sizeOfBuckets];

    cudaMemcpy2D(dev_current_list, pitch_d, list, pitch_h, width, height, cudaMemcpyHostToDevice);

    process_list<<<10, 1>>>(sizeOfBucketsHoldings, test, dev_current_list, pitch_d);
    cudaDeviceSynchronize();

    cudaMemcpy(h_test, test, sizeOfBuckets * sizeof(int), cudaMemcpyDeviceToHost);

    for (int i = 0; i < sizeOfBuckets; i++)
    printf("%d %d\n", i , h_test[i]);
    return 0;
    }

    要访问内核中的二维数组,您应该使用模式 base_addr + y * pitch_d + x .
    警告 :pitvh 总是以字节为单位。您需要将指针转换到 byte* .

    关于arrays - 将二维数组发送到 Cuda 内核,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11149793/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com