parallel-processing - 为什么相同的 OpenCL 代码在 Intel Xeon CPU 和 NVIDIA GTX 1080 Ti GPU 上有不同的输出？-6ren

parallel-processing - 为什么相同的 OpenCL 代码在 Intel Xeon CPU 和 NVIDIA GTX 1080 Ti GPU 上有不同的输出？

转载作者：行者123 更新时间：2023-12-04 08:14:36

我正在尝试使用 OpenCL 并行化 Monte Carlo 模拟。我使用 MWC64X 作为统一随机数生成器。代码在不同的 Intel CPU 上运行良好，因为并行计算的输出非常接近顺序计算。

Using OpenCL device: Intel(R) Xeon(R) CPU E5-2630L v3 @ 1.80GHz
Literal influence running time: 0.029048 seconds        r1 seqInfl= 0.4771
Literal influence running time: 0.029762 seconds        r2 seqInfl= 0.4771
Literal influence running time: 0.029742 seconds        r3 seqInfl= 0.4771
Literal influence running time: 0.02971 seconds         ra seqInfl= 0.4771
Literal influence running time: 0.029225 seconds        trust1-57 seqInfl= 0.6001
Literal influence running time: 0.04992 seconds         trust110-1 seqInfl= 0
Literal influence running time: 0.034636 seconds        trust4-57 seqInfl= 0
Literal influence running time: 0.049079 seconds        trust57-110 seqInfl= 0
Literal influence running time: 0.024442 seconds        trust57-4 seqInfl= 0.8026
Literal influence running time: 0.04946 seconds         trust33-1 seqInfl= 0
Literal influence running time: 0.049071 seconds        trust57-33 seqInfl= 0
Literal influence running time: 0.053117 seconds        trust4-1 seqInfl= 0.1208
Literal influence running time: 0.051642 seconds        trust57-1 seqInfl= 0
Literal influence running time: 0.052052 seconds        trust57-64 seqInfl= 0
Literal influence running time: 0.052118 seconds        trust64-1 seqInfl= 0
Literal influence running time: 0.051998 seconds        trust57-7 seqInfl= 0
Literal influence running time: 0.052069 seconds        trust7-1 seqInfl= 0
Total number of literals: 17
Sequential influence running time: 0.71728 seconds
Sequential maxInfluence Literal: trust57-4 0.8026

index1= 17 size= 51 dim1_size= 6
sum0:4781   influence0:0.478100 sum2:4781   influence2:0.478100 sum6:0  influence6:0.000000 sum10:0 sum12:0 influence12:0.000000    sum7:0  influence7:0.000000 influence10:0.000000    sum4:5962   influence4:0.596200 sum8:7971   influence8:0.797100 sum1:4781   influence1:0.478100 sum3:4781   influence3:0.478100 sum13:0 influence13:0.000000    sum11:1261  influence11:0.126100    sum9:0  influence9:0.000000 sum14:0 influence14:0.000000    sum5:0  influence5:0.000000 sum15:0 influence15:0.000000    sum16:0 influence16:0.000000    
Parallel influence running time: 0.054391 seconds
Parallel maxInfluence Literal: trust57-4 Infl=0.7971

但是，当我在安装了 NVIDIA-SMI 430.40 和 CUDA 10.1 以及 OpenCL 1.2 CUDA 的 GeForce GTX 1080 Ti 上运行代码时，输出如下:

Using OpenCL device: GeForce GTX 1080 Ti
Influence:
Literal influence running time: 0.011119 seconds        r1 seqInfl= 0.4771
Literal influence running time: 0.011238 seconds        r2 seqInfl= 0.4771
Literal influence running time: 0.011408 seconds        r3 seqInfl= 0.4771
Literal influence running time: 0.01109 seconds         ra seqInfl= 0.4771
Literal influence running time: 0.011132 seconds        trust1-57 seqInfl= 0.6001
Literal influence running time: 0.018978 seconds        trust110-1 seqInfl= 0
Literal influence running time: 0.013093 seconds        trust4-57 seqInfl= 0
Literal influence running time: 0.018968 seconds        trust57-110 seqInfl= 0
Literal influence running time: 0.009105 seconds        trust57-4 seqInfl= 0.8026
Literal influence running time: 0.018753 seconds        trust33-1 seqInfl= 0
Literal influence running time: 0.018583 seconds        trust57-33 seqInfl= 0
Literal influence running time: 0.02005 seconds         trust4-1 seqInfl= 0.1208
Literal influence running time: 0.01957 seconds         trust57-1 seqInfl= 0
Literal influence running time: 0.019686 seconds        trust57-64 seqInfl= 0
Literal influence running time: 0.019632 seconds        trust64-1 seqInfl= 0
Literal influence running time: 0.019687 seconds        trust57-7 seqInfl= 0
Literal influence running time: 0.019859 seconds        trust7-1 seqInfl= 0
Total number of literals: 17
Sequential influence running time: 0.272032 seconds
Sequential maxInfluence Literal: trust57-4 0.8026

index1= 17 size= 51 dim1_size= 6
sum0:10000  sum1:10000  sum2:10000  sum3:10000  sum4:10000  sum5:0  sum6:0  sum7:0  sum8:10000  sum9:0  sum10:0 sum11:0 sum12:0 sum13:0 sum14:0 sum15:0 sum16:0 
Parallel influence running time: 0.193581 seconds

“影响”值等于 sum*1.0/10000，因此并行影响仅由 1 和 0 组成，这是不正确的(在 GPU 运行中)并且在并行化时不会发生英特尔 CPU。

当我检查随机数生成器 if(flag==0) printf("randint=%u",randint); 的输出时，GPU 上的输出似乎全为零。下面是 clinfo 和 .cl 代码:

 Device Name                                     GeForce GTX 1080 Ti
  Device Vendor                                   NVIDIA Corporation
  Device Vendor ID                                0x10de
  Device Version                                  OpenCL 1.2 CUDA
  Driver Version                                  430.40
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Topology (NV)                            PCI-E, 68:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               28
  Max clock frequency                             1721MHz
  Compute Capability (NV)                         6.1
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x64
  Max work group size                             1024
  Preferred work group size multiple              32
  Warp size (NV)                                  32
  Preferred / native vector sizes                 
    char                                                 1 / 1       
    short                                                1 / 1       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 0 / 0        (n/a)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              11720130560 (10.92GiB)
  Error Correction support                        No
  Max memory allocation                           2930032640 (2.729GiB)
  Unified memory for Host and Device              No
  Integrated memory (NV)                          No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       4096 bits (512 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        458752 (448KiB)
  Global Memory cache line size                   128 bytes
  Image support                                   Yes
    Max number of samplers per kernel             32
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             16384x32768 pixels
    Max 3D image size                             16384x16384x16384 pixels
    Max number of read image args                 256
    Max number of write image args                16
  Local memory type                               Local
  Local memory size                               49152 (48KiB)
  Registers per block (NV)                        65536
  Max number of constant args                     9
  Max constant buffer size                        65536 (64KiB)
  Max size of kernel argument                     4352 (4.25KiB)
  Queue properties                                
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Kernel execution timeout (NV)                 Yes
  Concurrent copy and kernel execution (NV)       Yes
    Number of async copy engines                  2
  printf() buffer size                            1048576 (1024KiB)

#define N 70 // N > index, which is the total number of literals
#define BASE 4294967296UL

//! Represents the state of a particular generator
typedef struct{ uint x; uint c; } mwc64x_state_t;
enum{ MWC64X_A = 4294883355U };
enum{ MWC64X_M = 18446383549859758079UL };

void MWC64X_Step(mwc64x_state_t *s)
{
    uint X=s->x, C=s->c;

    uint Xn=MWC64X_A*X+C;
    uint carry=(uint)(Xn<C);                // The (Xn<C) will be zero or one for scalar
    uint Cn=mad_hi(MWC64X_A,X,carry);  

    s->x=Xn;
    s->c=Cn;
}

//! Return a 32-bit integer in the range [0..2^32)
uint MWC64X_NextUint(mwc64x_state_t *s)
{
    uint res=s->x ^ s->c;
    MWC64X_Step(s);
    return res;
}


__kernel void setInfluence(const int literals, const int size, const int dim1_size, __global int* lambdas, __global float* lambdap, __global int* dim2_size, __global float* influence){   
    int flag=get_global_id(0);
    int sum=0;
    int count=10000;
    int assignment[N];
    //or try to get newlambda like original version does
    if(flag < literals){
        mwc64x_state_t rng;
        for(int i=0; i<count; i++){
            for(int j=0; j<size; j++){
                uint randint=MWC64X_NextUint(&rng);
                float rand=randint*1.0/BASE;
                //if(flag==0)
                //  printf("randint=%u",randint);
                if(lambdap[j]<rand)
                    assignment[lambdas[j]]=0;
                else
                    assignment[lambdas[j]]=1;               
            }
            //the true case
            assignment[flag]=1;
            int valuet=0;
            int index=0;
            for(int m=0; m<dim1_size; m++){
                int valueMono=1;
                for(int n=0; n<dim2_size[m]; n++){
                    if(assignment[lambdas[index+n]]==0){
                        valueMono=0;
                        index+=dim2_size[m];
                        break;
                    }
                }
                if(valueMono==1){
                    valuet=1;
                    break;
                }
            }        
            //the false case
            assignment[flag]=0;
            int valuef=0;
            index=0;
            for(int m=0; m<dim1_size; m++){
                int valueMono=1;
                for(int n=0; n<dim2_size[m]; n++){
                    if(assignment[lambdas[index+n]]==0){
                        valueMono=0;
                        index+=dim2_size[m];
                        break;
                    }
                }
                if(valueMono==1){
                    valuef=1;
                    break;
                }
            }
            sum += valuet-valuef;            
        }
        influence[flag] = 1.0*sum/count;
        printf("sum%d:%d\t", flag, sum);
    }
}

在 GPU 上运行代码可能会出现什么问题？是 MWC64X 吗？根据其作者的说法，它可以在 NVIDIA GPU 上表现良好。如果是这样，我该如何解决？如果不是，可能是什么问题？

最佳答案

(这开始是评论，事实证明这是问题的根源，所以我把它变成了答案。)

在读取变量之前，您没有初始化您的 mwc64x_state_t rng; 变量，因此任何结果都将是未定义的:

    mwc64x_state_t rng;
    for(int i=0; i<count; i++){
        for(int j=0; j<size; j++){
            uint randint=MWC64X_NextUint(&rng);

MWC64X_NextUint() 在更新之前立即从 rng 状态读取:

uint MWC64X_NextUint(mwc64x_state_t *s)
{
    uint res=s->x ^ s->c;

请注意，您可能希望为每个工作项设置不同的 RNG，否则您会在结果中得到令人讨厌的相关伪影。

关于parallel-processing - 为什么相同的 OpenCL 代码在 Intel Xeon CPU 和 NVIDIA GTX 1080 Ti GPU 上有不同的输出？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57665817/

文章推荐： Joomla 将文章插入组件

文章推荐： amazon-web-services - AWS : How to manage instance ppk or pem files?

文章推荐： amazon-web-services - HashiCorp Vault - 生产环境中的设置/架构

intel-pin - intel pin工具中图像的含义
我是Intel pin工具的新手，最近开始研究pin工具。在教程中，描述了pin工具的模式: Sometimes, however, it can be useful to look at diffe
intel-pin - intel pin工具中图像的含义
我是Intel pin工具的新手，最近开始研究pin工具。在教程中，描述了pin工具的模式: Sometimes, however, it can be useful to look at diffe
intel - 如何开始使用库 intel ipp？
我得到了这份工作:1。产生一个正弦信号。2。使用 FFT 构建其频谱。首先，我为 visual studio 2010 安装了 Intel Parallel Studio XE 2011。在 vs 2
opencl - intel-compute-runtime、intel-opencl-runtime 和 intel-opencl-sdk 之间有什么区别？
看起来 Intel 提供了许多 OpenCL 实现。 ArchWiki描述 OpenCL 实现。它说 beignet 和 intel-opencl 已弃用。那么，intel-compute-runti
intel - 如何读取 "Intel Intrinsics Guide"？
我正在尝试通过阅读 Intel Intrinsics Guide 来开始使用 AVX512 内在函数但到目前为止我发现它没有定义命名数据类型或用于解释的伪代码语法。没有这样的定义，所谓的指南对我起码没
intel - AMD 与 Intel 处理器制作可执行文件
关闭。这个问题是opinion-based 。目前不接受答案。想要改进这个问题吗？更新问题，以便 editing this post 可以用事实和引文来回答它。 . 已关闭 4 年前。 Improv
android-studio - "Intel Atom Image"、 "Google APIs Intel Atom image"和 "Google play Intel Atom Image"之间有什么区别？
在 Android SDK 管理器中，我可以看到 3 种类型的 Intel Atom 图像。有人可以解释“Intel Atom Image”、“Google APIs Intel Atom Image
intel-pin - 使用 intel pintool 记录所有指令
我写了这个 pintool: #include "pin.H" #include #include VOID Instruction(INS ins, VOID *v) { cou
intel - 了解 Intel Intrinsics Guide 中的代码示例
我正在尝试了解 _mm256_permute2f128_ps() 的作用，但无法完全理解 intel's code-example . DEFINE SELECT4(src1, src2, contr
intel - 使用 Intel 内在函数 SSSE3 的替代方案时性能下降
我正在开发一个性能关键应用程序，该应用程序必须移植到仅支持 MMX、SSE、SSE2 和 SSE3 的英特尔凌动处理器中。我以前的应用程序支持 SSSE3 和 AVX，现在我想将其降级为 Intel
intel-pin - Intel Pin 3.0无法识别MPX指令？
我有最新版本的 Intel Pin 3.0 版本 76887。我有一个支持 MPX 的玩具示例: #include int g[10]; int main(int argc, char **arg
intel - 在 Intel 上使用 OpenSolaris 研究 SPARC 可执行结构
我想研究和比较elf、SPARC和PA-RISC的可执行文件结构。为了进行研究，我想在 Intel 机器 (Core2Duo) 上安装 OpenSolaris。但我有一个基本的疑问，它会起作用吗？
intel-mkl - 无法使用 g++ 将数学库与 intel mkl 链接
我尝试使用 g++ 用 intel mkl 11.1 进行编译: g++ -m32 test.c -lmkl_intel -lmkl_intel_thread -lmkl_core -liomp5 -
c++ - 我如何使用 intel 编译器和 intel mpi 安装 boost？
我正在按照以下说明进行操作: https://software.intel.com/en-us/articles/building-boost-with-intel-c-compiler-150 Co
c++ - -masm=intel 标志不适用于使用 Intel 语法在 gcc 编译器中运行汇编语言
我正在尝试在我的 C 程序中使用内联汇编程序 __asm，使用 Intel 语法而不是 AT&T 语法。我正在使用 gcc -S -masm=intel test.c 进行编译但它给出了错误。下面是我
c++ - Intel HD GPU 与 Intel CPU 性能比较
我是 OpenCL 的新手，目前对其性能有一些疑问。我有 Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz + ubuntu + Beignet(Intel 开源 op
Makefile:Intel fortran，文件夹中的源文件，和 Intel Math Kernel Library
我在/ex 文件夹中有一个 main.f90。 f77 子程序文件在/ex/src 中。子程序文件再次使用 BLAS 和 LAPACK 库。对于 BLAS 和 LAPACK，我必须使用英特尔数学核心函
c++ - 为什么此代码链接到 Intel Compiler 2015 而不是 Intel Compiler 2018？
我的团队最近从 2015 年英特尔编译器(并行工作室)升级到 2018 年版本，我们遇到了一个链接器问题，让每个人都焦头烂额。我有以下类(为简洁起见进行了适度编辑)，用于处理子进程的包装以及与它们对
intel - 为什么 Intel Haswell XEON CPU 偶尔会错误计算 FFT 和 ART？
在最后几天，我观察到我无法解释的新工作站的行为。对这个问题做一些研究，INTEL Haswell architecture 中可能存在一个可能的错误。以及在当前的 Skylake Generation
android-emulator - Intel HAXM 安装错误 - 此计算机不支持 Intel 虚拟化技术 (VT-x)
我的 HAXM 安装存在问题。事情是这样的。每次尝试为我的计算机安装 HAXM 时，我都会收到此错误: 问题是，我的计算机支持虚拟化技术(见下图)。知道如何解决这个问题吗？最佳答案只需执行以下步骤

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

parallel-processing - 为什么相同的 OpenCL 代码在 Intel Xeon CPU 和 NVIDIA GTX 1080 Ti GPU 上有不同的输出？