java - 使用 JOCL/OPENCL 加速强度总和计算-6ren

java - 使用 JOCL/OPENCL 加速强度总和计算

转载作者：搜寻专家更新时间：2023-11-01 03:26:04

您好，我是 JOCL (opencl) 的新手。我编写这段代码是为了获取每张图像的强度总和。内核采用一个一维数组，其中包含所有图像的所有像素，这些像素彼此放在一起。一张图片是 300x300 ，所以每张图片有 90000 像素。目前它比我按顺序执行此操作时慢。

我的代码

package PAR;

/*
 * JOCL - Java bindings for OpenCL
 * 
 * Copyright 2009 Marco Hutter - http://www.jocl.org/
 */
import IMAGE_IO.ImageReader;
import IMAGE_IO.Input_Folder;
import static org.jocl.CL.*;

import org.jocl.*;

/**
 * A small JOCL sample.
 */
public class IPPARA {

    /**
     * The source code of the OpenCL program to execute
     */
    private static String programSource =
            "__kernel void "
            + "sampleKernel(__global uint *a,"
            + "             __global uint *c)"
            + "{"
            + "__private uint intensity_core=0;"
            + "      uint i = get_global_id(0);"
            + "      for(uint j=i*90000; j < (i+1)*90000; j++){ "
            + "              intensity_core += a[j];"
            + "     }"
            + "c[i]=intensity_core;" 
            + "}";

    /**
     * The entry point of this sample
     *
     * @param args Not used
     */
    public static void main(String args[]) {
        long numBytes[] = new long[1];

        ImageReader imagereader = new ImageReader() ;
        int srcArrayA[]  = imagereader.readImages();

        int size[] = new int[1];
        size[0] = srcArrayA.length;
        long before = System.nanoTime();
        int dstArray[] = new int[size[0]/90000];


        Pointer srcA = Pointer.to(srcArrayA);
        Pointer dst = Pointer.to(dstArray);


        // Obtain the platform IDs and initialize the context properties
        System.out.println("Obtaining platform...");
        cl_platform_id platforms[] = new cl_platform_id[1];
        clGetPlatformIDs(platforms.length, platforms, null);
        cl_context_properties contextProperties = new cl_context_properties();
        contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]);

        // Create an OpenCL context on a GPU device
        cl_context context = clCreateContextFromType(
                contextProperties, CL_DEVICE_TYPE_CPU, null, null, null);
        if (context == null) {
            // If no context for a GPU device could be created,
            // try to create one for a CPU device.
            context = clCreateContextFromType(
                    contextProperties, CL_DEVICE_TYPE_CPU, null, null, null);

            if (context == null) {
                System.out.println("Unable to create a context");
                return;
            }
        }

        // Enable exceptions and subsequently omit error checks in this sample
        CL.setExceptionsEnabled(true);

        // Get the list of GPU devices associated with the context
        clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, null, numBytes);

        // Obtain the cl_device_id for the first device
        int numDevices = (int) numBytes[0] / Sizeof.cl_device_id;
        cl_device_id devices[] = new cl_device_id[numDevices];
        clGetContextInfo(context, CL_CONTEXT_DEVICES, numBytes[0],
                Pointer.to(devices), null);

        // Create a command-queue
        cl_command_queue commandQueue =
                clCreateCommandQueue(context, devices[0], 0, null);

        // Allocate the memory objects for the input- and output data
        cl_mem memObjects[] = new cl_mem[2];
        memObjects[0] = clCreateBuffer(context,
                CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                Sizeof.cl_uint * srcArrayA.length, srcA, null);
        memObjects[1] = clCreateBuffer(context,
                CL_MEM_READ_WRITE,
                Sizeof.cl_uint * (srcArrayA.length/90000), null, null);

        // Create the program from the source code
        cl_program program = clCreateProgramWithSource(context,
                1, new String[]{programSource}, null, null);

        // Build the program
        clBuildProgram(program, 0, null, null, null, null);

        // Create the kernel
        cl_kernel kernel = clCreateKernel(program, "sampleKernel", null);

        // Set the arguments for the kernel
        clSetKernelArg(kernel, 0,
                Sizeof.cl_mem, Pointer.to(memObjects[0]));
        clSetKernelArg(kernel, 1,
                Sizeof.cl_mem, Pointer.to(memObjects[1]));

        // Set the work-item dimensions
        long local_work_size[] = new long[]{1};
        long global_work_size[] = new long[]{(srcArrayA.length/90000)*local_work_size[0]};


        // Execute the kernel
        clEnqueueNDRangeKernel(commandQueue, kernel, 1, null,
                global_work_size, local_work_size, 0, null, null);

        // Read the output data
        clEnqueueReadBuffer(commandQueue, memObjects[1], CL_TRUE, 0,
                (srcArrayA.length/90000) * Sizeof.cl_float, dst, 0, null, null);

        // Release kernel, program, and memory objects
        clReleaseMemObject(memObjects[0]);
        clReleaseMemObject(memObjects[1]);
        clReleaseKernel(kernel);
        clReleaseProgram(program);
        clReleaseCommandQueue(commandQueue);
        clReleaseContext(context);


        long after = System.nanoTime();

        System.out.println("Time: " + (after - before) / 1e9);

    }
}

根据答案中的建议，通过 CPU 的并行代码几乎与顺序代码一样快。是否还有更多可以改进的地方？

最佳答案

 for(uint j=i*90000; j < (i+1)*90000; j++){ "
        + "              c[i] += a[j];"

1) 您正在使用全局内存 (c[]) 求和，这很慢。使用私有(private)变量使其更快。像这样:

          "__kernel void "
        + "sampleKernel(__global uint *a,"
        + "             __global uint *c)"
        + "{"
        + "__private uint intensity_core=0;" <---this is a private variable of each core
        + "      uint i = get_global_id(0);"
        + "      for(uint j=i*90000; j < (i+1)*90000; j++){ "
        + "              intensity_core += a[j];" <---register is at least 100x faster than global memory
         //but we cannot get rid of a[] so the calculation time cannot be less than %50
        + "     }"
        + "c[i]=intensity_core;"   
        + "}";  //expecting %100 speedup

现在你有 c[图像数量] 个强度总和数组。

你的 local-work-size 是 1 那么如果你有至少 160 张图像(这是你的 gpu 的核心数)那么计算将使用所有核心。

您将需要 90000*num_images 次读取和 num_images 次写入以及 90000*num_images 寄存器读/写。使用寄存器将使您的内核时间减半。

2) 你每 2 次内存访问只做 1 次数学运算。每次内存访问至少需要 10 个数学运算才能使用 gpu 峰值 Gflops 的一小部分(6490M 峰值为 250 Gflops)

您的 i7 cpu 可以轻松达到 100 Gflops，但您的内存将成为瓶颈。当您通过 pci-express 发送整个数据时，情况更糟。(HD Graphics 3000 额定为 125 GFLOPS)

 // Obtain a device ID 
    cl_device_id devices[] = new cl_device_id[numDevices];
    clGetDeviceIDs(platform, deviceType, numDevices, devices, null);
    cl_device_id device = devices[deviceIndex];
 //one of devices[] element must be your HD3000.Example: devices[0]->gpu devices[1]->cpu 
 //devices[2]-->HD3000

在你的程序中:

 // Obtain the cl_device_id for the first device
    int numDevices = (int) numBytes[0] / Sizeof.cl_device_id;
    cl_device_id devices[] = new cl_device_id[numDevices];
    clGetContextInfo(context, CL_CONTEXT_DEVICES, numBytes[0],
            Pointer.to(devices), null);

第一个设备可能是 gpu。

关于java - 使用 JOCL/OPENCL 加速强度总和计算，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/13543248/

文章推荐： java - 使用 Entryset 遍历 Hashmap

文章推荐： java - Lucene 4.0 中的词频

文章推荐： java - 从 JSP 访问 Struts ActionForm 中的值

文章推荐： java - gae 1.7.3 更新 -> java.io.InvalidClassException

android - 如何在录制时检查音频的强度(强度)？
我正在开发一个录音机应用程序。我想知道在录制音频时有什么方法可以找到音频的强度。我不想将录音保存在任何地方。我只想向用户展示麦克风捕捉到的声音是否大于预定义的阈值。假设如果声音低于 2 分贝，它应该
Python 套接字发送缓冲区与。强度
我正在尝试让一个基本服务器(从 Beginning Python 复制)来发送一个 str。错误: c.send( "XXX" ) TypeError: must be bytes or buffe
android - 增加 RGB 强度
我陷入了一个问题，不知道去哪里看。我需要增加图像中特定颜色的强度，例如 R、G 或蓝色。当我这样做时，某些颜色无法正确呈现。下面是我为测试拍摄的图像: 现在当我像绿色一样增加时: A = Color
swift - 如何编辑 UIBlurEffect 强度？
我不希望我的背景图片太模糊。没有调整模糊强度的属性吗？ let blurEffect = UIBlurEffect(style: UIBlurEffectStyle.Light) blurEffect
c++ - OpenCV 改变 RGB 强度
我使用 OpenCV 2.4.11+Qt 并尝试制作视频并更改红色/蓝色或绿色 channel 的强度，但没有找到任何功能或设置来执行此操作。有谁知道如何做到这一点？最佳答案如果您只想更改一个特定
python - 从三个列表中绘制热图 : X, Y，强度
当我有 x、y、强度时，我不知道如何创建热图(或等高线图)。我有一个看起来像这样的文件: 0,1,6 0,2,10 .... 到目前为止: with open('eye_.txt', 'r') as
iphone - iPhone 应用程序中的 Wi-Fi 强度
有谁有一些可以在 iPhone 应用程序中使用的代码，让我可以看到 wifi 的强度吗？我有一个连接密集型操作，并且希望它们不在不稳定区域最佳答案这可能会帮助您走上正确的道路...... http
image-processing - 将 RGB 转换为灰度/强度
当从 RGB 转换为灰度时，据说应该对 R、G 和 B channel 应用特定的权重。这些权重是:0.2989、0.5870、0.1140。据说这是因为人类对这三种颜色的感知/感受不同。有时也有人
Eclipse SSH key 生成 - key 强度
Eclipse SSH key 生成屏幕(常规 -> 网络连接 -> SSH2)生成 1024 位 RSA key ，该 key 太弱而无法使用 ( http://news.netcraft.com/
image-processing - 将 RGB 转换为灰度/强度
当从 RGB 转换为灰度时，据说应该对 R、G 和 B channel 应用特定的权重。这些权重是:0.2989、0.5870、0.1140。据说这是因为人类对这三种颜色的感知/感受不同。有时也有人
javascript - 如何使用 Web Vibration API 控制振动的幅度/强度？
我们的网络应用程序使用 the Vibrate API对于微妙的按钮按下效果: window.navigator.vibrate(5); 但在我的新手机上，感觉不那么微妙，更像是我的手机正试图从我手中
ios - 如何使用 Swift 获取我周围的 wifi 连接列表及其 RSSI 强度？
我的应用程序应扫描周围的 Wifi 信号并列出网络名称及其 RSSI。我在谷歌上找不到任何关于如何做的线索。有人可以举个例子或者至少指出其他地方我可以找到答案吗？最佳答案我认为这不可能!不管是
image-processing - 校正局部暗点/亮点的图像，均衡亮度/强度(局部位置，而不是暗/中/亮区域)
所以我的图像有一些黑点，它们看起来很简单，所以我想我可以创建一个亮度图，将其反转，然后将其应用到我的图像以消除黑点。然而，我所能找到的只有两种均衡方法:均衡整个图像(使用直方图)或将图像分割成深色和浅

搜寻专家

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

java - 使用 JOCL/OPENCL 加速强度总和计算