- android - 多次调用 OnPrimaryClipChangedListener
- android - 无法更新 RecyclerView 中的 TextView 字段
- android.database.CursorIndexOutOfBoundsException : Index 0 requested, 光标大小为 0
- android - 使用 AppCompat 时,我们是否需要明确指定其 UI 组件(Spinner、EditText)颜色
我正在通过实现 Matrix 点积来深入研究 OpenCL。我在让我的内核返回与我的主机相同的值时遇到了问题。
我做了一个封装函数,分配设备内存,设置内核参数,运行内核并将结果返回给主机。
/* This function runs the matrix dot product on whatever OpenCL device
* you specify
*/
cl_int OpenCL_MatrixMul(cl_device_id * device, cl_context * context,
cl_command_queue * commandQueue, cl_kernel * matrixMulKernel, float * A_h,
float * B_h, float * C_h, const cl_uint HeightA, const cl_uint WidthB,
const cl_uint WidthAHeightB)
{
printf("Inside matrix mul, WidthA: %zu, WidthB: %zu, WidthAHeightB: %zu\n",
HeightA, WidthB, WidthAHeightB);
//this error variable will record any errors found and will be returned
//by this function
cl_int error = CL_SUCCESS;
cl_int clEnqueueReadBuffer_error;
//declare a place for the memory on the device, A is the A matrix,
//B is the B matrix, C is the C result matrix
cl_mem A_d, B_d, C_d;
//this is a temporary value for holding the maximum work group size
size_t maximum_local_ws;
//variable for holding the number of work items per group
size_t local_ws[2];
//variable for holding the number of work items
size_t global_ws[2];
//calcuate work group and local size
//get the maximum work group size for the kernel, i.e. set local_ws
clGetKernelWorkGroupInfo((* matrixMulKernel), (* device),
CL_KERNEL_WORK_GROUP_SIZE, sizeof(maximum_local_ws),
&maximum_local_ws, NULL);
//find the largest integer, power of 2, square root, for maximum_local_ws
//that is less than or equal to 16
for(size_t i = 1; (i * i) <= maximum_local_ws && i <= maxBlockSize; i *= 2)
{
local_ws[0] = i;
local_ws[1] = i;
}
//calculate global work size
global_ws[0] = WidthB;
global_ws[1] = HeightA;
printf("Work group size calculated.\n");
//Allocate global memory on the device
//put A on the device
A_d = clCreateBuffer ((* context), CL_MEM_COPY_HOST_PTR,
(WidthAHeightB * HeightA * sizeof(float)), A_h, &error);
//put B on the device
B_d = clCreateBuffer ((* context), CL_MEM_COPY_HOST_PTR,
(WidthB * WidthAHeightB * sizeof(float)), B_h, &error);
//create a space for C on the device
C_d = clCreateBuffer ((* context), CL_MEM_READ_WRITE,
(HeightA * WidthB * sizeof(float)), NULL, &error);
printf("Global memory allocated.\n");
if(error == CL_SUCCESS)
{
//set the prarameters of the kernels
//Put in A
error = clSetKernelArg((* matrixMulKernel), 0, sizeof(cl_mem), &A_d);
//Put in B
error |= clSetKernelArg((* matrixMulKernel), 1, sizeof(cl_mem), &B_d);
//Put in C
error |= clSetKernelArg((* matrixMulKernel), 2, sizeof(cl_mem), &C_d);
//Put in HeightA
error |= clSetKernelArg((* matrixMulKernel), 3, sizeof(cl_uint),
&HeightA);
//Put in WidthB
error |= clSetKernelArg((* matrixMulKernel), 4, sizeof(cl_uint),
&WidthB);
//Put in WidthAHeightB
error |= clSetKernelArg((* matrixMulKernel), 5, sizeof(cl_uint),
&WidthAHeightB);
printf("Parameters added to the kernel.\n");
if(error == CL_SUCCESS)
{
//execute the kernel
printf("Running Kernel, Local work size: %zu x %zu global worksize:
%zu x %zu, HeightA: %zu, WidthB: %zu, WidthAHeightB: %zu\n",
local_ws[0], local_ws[1], global_ws[0], global_ws[1],
HeightA, WidthB, WidthAHeightB);
error = clEnqueueNDRangeKernel((* commandQueue),
(* matrixMulKernel), 1, NULL, global_ws, local_ws, 0, NULL,
NULL);
printf("Kernel Ran.\n");
if(error == CL_SUCCESS)
{
printf("Kernel Launched Successfully\n");
}
else
{
printf("Kernel Not Launched\n");
}
}
}
else
{
printf("Parameters not added to the kernel.\n");
}
printf("Reading results back from device\n");
//read the result back to the host system, (copy C_h to C_d)
clEnqueueReadBuffer_error = clEnqueueReadBuffer((* commandQueue), C_d,
CL_TRUE, 0, HeightA * WidthB * sizeof(float), C_h, 0, NULL, NULL);
//make sure we don't write over previous errors, if
//clEnqueueReadBuffer_error has an error
if(error == CL_SUCCESS)
{
error = clEnqueueReadBuffer_error;
}
printf("Freeing device memory\n");
//Free global memory on the device
clReleaseMemObject(A_d);
clReleaseMemObject(B_d);
clReleaseMemObject(C_d);
return error;
}
这段代码在运行时会输出一些奇怪的东西:
Inside matrix mul, WidthA: 16, WidthB: 16, WidthAHeightB: 16
Work group size calculated.
Global memory allocated.
Parameters added to the kernel.
Running Kernel, Local work size: 1 x 1 global worksize: 16 x 16, HeightA: 16, WidthB: 140733193388048, WidthAHeightB: 16
Kernel Ran.
Kernel Launched Successfully
Reading results back from device
Freeing device memory
由于某些原因,widthB 的值从 16 更改为 140733193388048。奇怪的是,widthB 不同,但 WidthA 和 WidthAHeightB 尽管使用相同的方式,但保持不变。此外,在我对它进行的所有调用中,值 140733193388048 仍然异常确定。
因此,我的设备返回的矩阵的第一行与主机相同,但后续值不同。
我在 Mac OS X 上使用 Apple 在 Snow Leopard 中的 OpenCL 实现进行编程。
这是怎么回事,你如何防止这样的事情发生?
最佳答案
我的内核没有返回正确答案的原因之一是我没有为 clEnqueueNDRangeKernel 提供正确的工作组维数。我仍然得到 WidthB 的奇怪输出,如果我想尝试调试我的程序,知道我的打印输出将不准确,这让我感到很不安。
关于clSetKernelArg 将 arg_value 从 16 更改为 140733193388048?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11574583/
我正在通过实现 Matrix 点积来深入研究 OpenCL。我在让我的内核返回与我的主机相同的值时遇到了问题。 我做了一个封装函数,分配设备内存,设置内核参数,运行内核并将结果返回给主机。 /* T
我是一名优秀的程序员,十分优秀!