gpt4 book ai didi

c++ - 无法确定 OpenCL 中 clEnqueueNDRangeKernel 的速度

转载 作者:行者123 更新时间:2023-11-28 05:49:34 41 4
gpt4 key购买 nike

我使用的是 clEnqueueNDRangeKernel 函数,它比 clEnqueueTask 快得多。尽管如此,我还是不能让它快于 16 毫秒,即使添加更多的 global_item_size 也无济于事。它只是停止在 global_item_size = 3 上更快地工作,仅此而已,我仍然认为它应该在更大的 global_size 下更快地工作。我错了吗?我该如何解决?

__kernel void red_to_green(__global unsigned char *pDataIn, __global unsigned char *pDataOut, unsigned int InSize, unsigned int OutSize)
{


unsigned int gid = get_global_id(0);
unsigned int gsize = get_global_size(0);
unsigned int lid = get_local_id(0);
unsigned int lsize = get_local_size(0);



unsigned int vstart = ((InSize/gsize) * gid);
unsigned int vstop = (vstart + (InSize/gsize));

for (unsigned int i = vstart; i < vstop; i+=4)
{

pDataOut[i/4] = (pDataIn[i] + pDataIn[i + 1] + pDataIn[i + 2]) / 3;

}

    vector<unsigned char> pDataIn;
vector<unsigned char> pDataOut;
SizeIn = pDataIn.size();
SizeOut = pDataOut.size();
const size_t cycles_max = 100;
clock_t t4 = clock();
for (int i = 0; i<cycles_max; i++){

double start_time = clock();
double search_time = 0;
//float last_time = 0;

//execute opencl kernel
//ret = clEnqueueTask(command_queue, kernel, 0, NULL, NULL);

size_t global_item_size = 3;
size_t local_item_size = 1;

ret = clEnqueueNDRangeKernel(command_queue,kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL);

//copy from buffer
ret = clEnqueueReadBuffer(command_queue, memobj1, CL_TRUE, 0, pDataOut.size(), pDataOut.data(), 0, NULL, NULL);

ret = clFinish(command_queue);

double end_time = clock(); // конечное время
search_time = end_time - start_time;
//float last_time = last_time + search_time;
cout << search_time << " ms" << endl;

}
clock_t t5 = clock();
double time_seconds2 = (t5-t4)*CLOCKS_PER_SEC/cycles_max;
cout << "Average time: " << time_seconds2/1000 << " ms" <<endl;
WriteBmpFile(L"3840x2160_ndrange.bmp", iWidth, iHeight, 8, pDataOut.size(), pDataOut.data(), false);
system("PAUSE");

Output time

最佳答案

Still, I can't make it faster than 16 ms, even adding more global_item_size doesn't help. It just stops working faster on global_item_size = 3 and that's all, still I think that it should work faster with more global_size. Am I wrong? And how can I fix it?

仅增加全局大小将无济于事,因为您将局部大小设置为 1。这意味着您的工作组大小为 1,效率非常低。 GPU Nvidia GT 740M 有 2 个计算单元,这意味着通常它可以同时运行 2 个工作组,因此在将全局大小设置为 3 后你看不到任何改进。

尝试将局部大小增加到至少 128 以充分利用 GPU(或 512 或 1024)。 CUDA Occupancy Calculator有助于确定最佳设置。

关于c++ - 无法确定 OpenCL 中 clEnqueueNDRangeKernel 的速度,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35539954/

41 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com