clSetKernelArg 将 arg_value 从 16 更改为 140733193388048？-6ren

clSetKernelArg 将 arg_value 从 16 更改为 140733193388048？

转载作者：太空宇宙更新时间：2023-11-04 04:54:00

我正在通过实现 Matrix 点积来深入研究 OpenCL。我在让我的内核返回与我的主机相同的值时遇到了问题。

我做了一个封装函数，分配设备内存，设置内核参数，运行内核并将结果返回给主机。

 /* This function runs the matrix dot product on whatever OpenCL device 
  * you specify 
  */
cl_int OpenCL_MatrixMul(cl_device_id * device, cl_context * context, 
    cl_command_queue * commandQueue, cl_kernel * matrixMulKernel, float * A_h, 
    float * B_h, float * C_h, const cl_uint HeightA, const cl_uint WidthB, 
    const cl_uint WidthAHeightB)
{
    printf("Inside matrix mul, WidthA: %zu, WidthB: %zu, WidthAHeightB: %zu\n", 
        HeightA, WidthB, WidthAHeightB);

    //this error variable will record any errors found and will be returned 
    //by this function
    cl_int error = CL_SUCCESS;
    cl_int clEnqueueReadBuffer_error;

    //declare a place for the memory on the device, A is the A matrix, 
    //B is the B matrix, C is the C result matrix
    cl_mem A_d, B_d, C_d;               
    //this is a temporary value for holding the maximum work group size
    size_t maximum_local_ws;

    //variable for holding the number of work items per group
    size_t local_ws[2]; 
    //variable for holding the number of work items              
    size_t global_ws[2];            

    //calcuate work group and local size
    //get the maximum work group size for the kernel, i.e. set local_ws
    clGetKernelWorkGroupInfo((* matrixMulKernel), (* device), 
        CL_KERNEL_WORK_GROUP_SIZE, sizeof(maximum_local_ws), 
        &maximum_local_ws, NULL);

    //find the largest integer, power of 2, square root, for maximum_local_ws 
    //that is less than or equal to 16
    for(size_t i = 1; (i * i) <= maximum_local_ws && i <= maxBlockSize; i *= 2)
    {
        local_ws[0] = i;
        local_ws[1] = i;
    }
    //calculate global work size
    global_ws[0] = WidthB;  
    global_ws[1] = HeightA;

    printf("Work group size calculated.\n");

    //Allocate global memory on the device
    //put A on the device
    A_d = clCreateBuffer ((* context), CL_MEM_COPY_HOST_PTR, 
        (WidthAHeightB * HeightA * sizeof(float)), A_h, &error);    
    //put B on the device   
    B_d = clCreateBuffer ((* context), CL_MEM_COPY_HOST_PTR, 
        (WidthB * WidthAHeightB * sizeof(float)), B_h, &error);
    //create a space for C on the device        
    C_d = clCreateBuffer ((* context), CL_MEM_READ_WRITE, 
        (HeightA * WidthB * sizeof(float)), NULL, &error);              

    printf("Global memory allocated.\n");

    if(error == CL_SUCCESS)
    {
        //set the prarameters of the kernels
        //Put in A
        error  = clSetKernelArg((* matrixMulKernel), 0, sizeof(cl_mem), &A_d);
        //Put in B                                                  
        error |= clSetKernelArg((* matrixMulKernel), 1, sizeof(cl_mem), &B_d);
        //Put in C                                  
        error |= clSetKernelArg((* matrixMulKernel), 2, sizeof(cl_mem), &C_d);                          
        //Put in HeightA
        error |= clSetKernelArg((* matrixMulKernel), 3, sizeof(cl_uint), 
            &HeightA);                              
        //Put in WidthB
        error |= clSetKernelArg((* matrixMulKernel), 4, sizeof(cl_uint), 
            &WidthB);                               
        //Put in WidthAHeightB
        error |= clSetKernelArg((* matrixMulKernel), 5, sizeof(cl_uint),
            &WidthAHeightB);                        

        printf("Parameters added to the kernel.\n");

        if(error == CL_SUCCESS)
        {
            //execute the kernel
            printf("Running Kernel, Local work size: %zu x %zu global worksize: 
            %zu x %zu, HeightA: %zu, WidthB: %zu, WidthAHeightB: %zu\n", 
                local_ws[0], local_ws[1], global_ws[0], global_ws[1], 
                HeightA, WidthB, WidthAHeightB);
            error = clEnqueueNDRangeKernel((* commandQueue),   
                (* matrixMulKernel), 1, NULL, global_ws, local_ws, 0, NULL, 
                NULL);

                printf("Kernel Ran.\n");

            if(error == CL_SUCCESS)
            {
                 printf("Kernel Launched Successfully\n");
            }
            else
            {
                printf("Kernel Not Launched\n");
            }
        }
    }
    else 
    {
        printf("Parameters not added to the kernel.\n");
    }
    printf("Reading results back from device\n");

    //read the result back to the host system, (copy C_h to C_d)
    clEnqueueReadBuffer_error = clEnqueueReadBuffer((* commandQueue), C_d,  
        CL_TRUE, 0, HeightA * WidthB * sizeof(float), C_h, 0, NULL, NULL);

    //make sure we don't write over previous errors, if 
    //clEnqueueReadBuffer_error has an error
    if(error == CL_SUCCESS)
    {
        error = clEnqueueReadBuffer_error;
    }

    printf("Freeing device memory\n");

    //Free global memory on the device
    clReleaseMemObject(A_d);
    clReleaseMemObject(B_d);
    clReleaseMemObject(C_d);

    return error;
}

这段代码在运行时会输出一些奇怪的东西:

Inside matrix mul, WidthA: 16, WidthB: 16, WidthAHeightB: 16
Work group size calculated.
Global memory allocated.
Parameters added to the kernel.
Running Kernel, Local work size: 1 x 1 global worksize: 16 x 16, HeightA: 16, WidthB: 140733193388048, WidthAHeightB: 16
Kernel Ran.
Kernel Launched Successfully
Reading results back from device
Freeing device memory

由于某些原因，widthB 的值从 16 更改为 140733193388048。奇怪的是，widthB 不同，但 WidthA 和 WidthAHeightB 尽管使用相同的方式，但保持不变。此外，在我对它进行的所有调用中，值 140733193388048 仍然异常确定。

因此，我的设备返回的矩阵的第一行与主机相同，但后续值不同。

我在 Mac OS X 上使用 Apple 在 Snow Leopard 中的 OpenCL 实现进行编程。

这是怎么回事，你如何防止这样的事情发生？

最佳答案

我的内核没有返回正确答案的原因之一是我没有为 clEnqueueNDRangeKernel 提供正确的工作组维数。我仍然得到 WidthB 的奇怪输出，如果我想尝试调试我的程序，知道我的打印输出将不准确，这让我感到很不安。

关于clSetKernelArg 将 arg_value 从 16 更改为 140733193388048？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/11574583/

文章推荐： css - WordPress字体跨浏览器问题

文章推荐： python - 比较 Pandas 中的列并合并

clSetKernelArg 将 arg_value 从 16 更改为 140733193388048？
我正在通过实现 Matrix 点积来深入研究 OpenCL。我在让我的内核返回与我的主机相同的值时遇到了问题。我做了一个封装函数，分配设备内存，设置内核参数，运行内核并将结果返回给主机。 /* T

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

clSetKernelArg 将 arg_value 从 16 更改为 140733193388048？