python - PyOpenCl 基准问题-6ren

python - PyOpenCl 基准问题

转载作者：行者123 更新时间：2023-11-28 17:52:23

30

4

我对 https://github.com/inducer/pyopencl/blob/master/examples/benchmark-all.py 中的标准代码进行了一些修改

用数字代替，变量zz

import pyopencl as cl
import numpy
import numpy.linalg as la
import datetime
from time import time
zz=100
a = numpy.random.rand(zz).astype(numpy.float32)
b = numpy.random.rand(zz).astype(numpy.float32)
c_result = numpy.empty_like(a)

# Speed in normal CPU usage
time1 = time()
for i in range(zz):
        for j in range(zz):
                c_result[i] = a[i] + b[i]
                c_result[i] = c_result[i] * (a[i] + b[i])
                c_result[i] = c_result[i] * (a[i] / 2)
time2 = time()
print("Execution time of test without OpenCL: ", time2 - time1, "s")


for platform in cl.get_platforms():
    for device in platform.get_devices():
        print("===============================================================")
        print("Platform name:", platform.name)
        print("Platform profile:", platform.profile)
        print("Platform vendor:", platform.vendor)
        print("Platform version:", platform.version)
        print("---------------------------------------------------------------")
        print("Device name:", device.name)
        print("Device type:", cl.device_type.to_string(device.type))
        print("Device memory: ", device.global_mem_size//1024//1024, 'MB')
        print("Device max clock speed:", device.max_clock_frequency, 'MHz')
        print("Device compute units:", device.max_compute_units)

        # Simnple speed test
        ctx = cl.Context([device])
        queue = cl.CommandQueue(ctx, 
                properties=cl.command_queue_properties.PROFILING_ENABLE)

        mf = cl.mem_flags
        a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
        b_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b)
        dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, b.nbytes)

        prg = cl.Program(ctx, """
            __kernel void sum(__global const float *a,
            __global const float *b, __global float *c)
            {
                        int loop;
                        int gid = get_global_id(0);
                        for(loop=0; loop<%s;loop++)
                        {
                                c[gid] = a[gid] + b[gid];
                                c[gid] = c[gid] * (a[gid] + b[gid]);
                                c[gid] = c[gid] * (a[gid] / 2);
                        }
                }
                """ % (zz)).build()

        exec_evt = prg.sum(queue, a.shape, None, a_buf, b_buf, dest_buf)
        exec_evt.wait()
        elapsed = 1e-9*(exec_evt.profile.end - exec_evt.profile.start)

        print("Execution time of test: %g s" % elapsed)

        c = numpy.empty_like(a)
        cl.enqueue_read_buffer(queue, dest_buf, c).wait()
        error = 0
        for i in range(zz):
                if c[i] != c_result[i]:
                        error = 1
        if error:
                print("Results doesn't match!!")
        else:
                print("Results OK")

如果 zz=100 我有:

('Execution time of test without OpenCL: ', 0.10500001907348633, 's')
===============================================================
('Platform name:', 'AMD Accelerated Parallel Processing')
('Platform profile:', 'FULL_PROFILE')
('Platform vendor:', 'Advanced Micro Devices, Inc.')
('Platform version:', 'OpenCL 1.1 AMD-APP-SDK-v2.5 (684.213)')
---------------------------------------------------------------
('Device name:', 'Cypress\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')
('Device type:', 'GPU')
('Device memory: ', 800, 'MB')
('Device max clock speed:', 850, 'MHz')
('Device compute units:', 20)
Execution time of test: 0.00168922 s
Results OK
===============================================================
('Platform name:', 'AMD Accelerated Parallel Processing')
('Platform profile:', 'FULL_PROFILE')
('Platform vendor:', 'Advanced Micro Devices, Inc.')
('Platform version:', 'OpenCL 1.1 AMD-APP-SDK-v2.5 (684.213)')
---------------------------------------------------------------
('Device name:', 'Intel(R) Core(TM) i5 CPU         750  @ 2.67GHz\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')
('Device type:', 'CPU')
('Device memory: ', 8183L, 'MB')
('Device max clock speed:', 3000, 'MHz')
('Device compute units:', 4)
Execution time of test: 4.369e-05 s
Results OK

我们有 3 次:

normal  ('Execution time of test without OpenCL: ', 0.10500001907348633, 's')
pyopencl radeon 5870: Execution time of test: 0.00168922 s
pyopencl i5 CPU 750: Execution time of test: 4.369e-05 s

第一个问题包:什么是 pyopencl i5 CPU 750？为什么他比“正常”(“没有 OpenCL 的测试执行时间”)快 250 倍？为什么他比“pyopencl radeon 5870”快 38 倍？

如果 zz=1000 我们有:

normal  ('Execution time of test without OpenCL: ', 9.05299997329712, 's')
pyopencl radeon 5870:Execution time of test: 0.0104431 s
pyopencl i5 CPU 750: Execution time of test: 0.00238112 s

i5*5=radeon5870

i5*3800=正常

如果zz=10000

normal its to long... comment code...
redeon58700, Execution time of test: 0.085571 s
i5, Execution time of test: 0.261854 s

这里我们看看如何赢得视频卡。

比较时序结果还是很有意思的。normal_stage1*90=normal_stage2 normal_stage2*~95=normal_stage3(根据经验)

i5_stage1*52=i5_stage2 i5_stage2*109=i5_stage3

radeon5870_stage1*6=radeon_stage2 radeon_stage2*8=radeon_stage3

有人解释为什么 opencl 的增长不是线性的吗？

最佳答案

好吧，增长不太可能是线性的，因为算法复杂度是 O(zz^2)。

要得出关于“线性”的结论，你应该有超过 3 个点(并且在进行此类分析时误差线也非常有用)，因为对于 GPU 而言，100 个线程目前还不足以充分利用它的计算能力(因为你的实验表明，GPU 仅在 10k 或更多线程时才开始击败 CPU——这是很正常的情况)。

仅在 CPU 上提速 250 倍也不是不可能，因为 python 是交互语言，所以它本身不是很快，而且 OpenCL 积极使用 CPU 的 SIMD 指令，这也提供了相当好的速度提升，即使与 C+OpenMP 相比也是如此。

关于python - PyOpenCl 基准问题，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/7376616/

30

4

0

文章推荐： python - 在 python 中解析 METAR 网页

文章推荐： javascript - 从对象数组中获取 JavaScript 对象

文章推荐： javascript - 在 Angular 1.6 应用程序中注入(inject)服务

文章推荐： Python urllib2.open 连接由对等错误重置

python - Python 基准
我想创建一个 Python 基准测试列表。现在我只找到了 this 中的标准基准测试问题和一些来自 Computer Language Benchmarks Game . Python 还有其他基准测
Hadoop 基准 : TestDFSIO
我正在使用 apache 提供的基准文件 TestDFSIO 测试我的 hadoop 配置。我正在根据本教程(资源 1)运行它: http://www.michael-noll.com/blog/20
ruby - 基准 ruby
我刚刚安装了 Ruby 企业版，想对我的系统 Ruby 运行一些基准测试。是否有我应该实现的规范基准测试？最佳答案最有趣最深入Ruby benchmarks Antonio Cangiano 的系
python - 可视化 ffmpeg 基准
我已经生成了基准，用于比较使用 ffmpeg 工具缩小视频文件 (mp4) 的两种方法。基准以这种格式记录: x.mp4 Output_Resolution : 360p Method : A re
codeigniter 基准 {memory_usage} 安全
我正在使用 codeigniter 制作一个网站。如果用户在他的评论中写入 {memory_usage} 2.75MB 将显示给他。它不会给 codeigniter 编写的代码带来安全漏洞吗？有什么
.net - 基准 XSLT 性能
我正在尝试对 XSLT 的两个版本进行基准测试。目前我使用 Visual Studio 进行调试，因为从 .NET 组件调用的 xml 转换。 VS 2010 是我用于开发的 IDE。我得到的唯一线
c - 如何使用源代码测量每个节点的 MPI 基准？
我想知道如何测量每个节点的内存带宽(流基准)。我的这个程序仅在一个节点上进行测量，进程和线程的数量如下: MPI_Comm_size(MPI_COMM_WORLD, &numranks); MPI_C
c# - EF 5 基准
我正在关注 performance test Dapper 社区创建的。目前，我在运行测试 10000 次后得到以下信息: EF 5 = 21595 毫秒 ADO.NET = 52183 毫秒小巧
c++ - 超过理论峰值 FLOPS 基准
为了测量 CPU 的峰值 FLOPS 性能，我编写了一个小的 C++ 程序。但是测量结果给我的结果比我的 CPU 的理论峰值 FLOPS 大。怎么了？这是我写的代码: #include #incl
java - 基准 JUnit AllTests
有没有办法在 JUnit 测试套件中放置简单的开始/停止计时？当我创建一个测试套件类时，它看起来像这样，我可以运行它。但是我怎么能在这里放一个简单的长开始时间变量来显示所有测试运行了多长时间？ pu
mysql - 在同一个表中的多个线程上批量插入 MySQL 基准
我想测试MySQL数据库的InnoDB和MyRock引擎之间的高强度写入。为此，我使用 sysbench 进行基准测试。我的要求是: 多线程并发写入同一张表。支持批量插入(每次插入事务都会插入大量记
performance - 基准 Nodejs 项目
我正在尝试构建一个 Nodejs Web 应用程序。当我添加更多代码时，最好有一种方法来测试此类更改对性能的影响，如果可能的话，以及我的应用程序在哪些方面花费最多时间。我目前正在使用 mocha 作为
javascript - 为基于网络的动画设置 FPS 基准？
我希望编写一个简单的每秒帧数动画基准 Javascript 实用程序。 FPS 在这里可能是一个模糊的术语，但理想情况下，它可以让我更准确地比较和衡量不同动画 (CSS3/canvas/webgl)
python - 基准 Python 程序
我是 Python 新手。这是我的第一种解释语言。到目前为止，我曾经学习过Java。因此，当 Java 程序第一次运行时，它的执行速度比下一次要慢。reasi 正在缓存。 import time de
Apache 基准 HTTPS 失败
我在 Ubuntu 虚拟机中使用 Apache 2.4.2。我用它来加载测试，向某些 HTTPS url 发送请求。失败请求数为零。但是我的请求都无法真正处理(已经在数据库中查找)。使用相同的 url
javascript - WebGL 基准 - 我应该创建什么样的测试？
(我不确定这是否应该在 https://softwareengineering.stackexchange.com/ 上，如果您认为是，请评论) 我即将为我的学士论文创建 WebGL 实现的基准。我不
java - 有没有好的 Clojure 基准？
编辑: Clojure 基准测试已达到 the Benchmarks Game 。我已经制作了这个问题社区 wiki 并邀请其他人保持更新。有人知道 Clojure 的性能基准吗？我自己做了一些
json - 基准 : BSON vs JSON
关注 this benchmark BSON 需要更多的磁盘空间和时间来创建、序列化、反序列化和遍历所有元素。 BSON 的一大优势是，它的遍历速度要快得多。那么这个基准有什么问题呢？最佳答案你的
benchmarking - 基准 channel 创建 NextFlow
我正在 NextFlow 上执行分散-聚集操作。它看起来像下面这样: reads = PATH+"test_1.fq" outdir = "results" split_read_ch = chan
linux - Apache 基准 HTTPS 问题
我无法让apache benchmark与我的网站配合使用。每当我发出此命令时 ab https://example.com/ 我会得到这个输出错误: This is ApacheBench, Ver

首页

博学

6Ren·AI

商城

python - PyOpenCl 基准问题