CUDA 结构对齐正在减慢我的代码(可编译示例)-6ren

CUDA 结构对齐正在减慢我的代码(可编译示例)

转载作者：行者123 更新时间：2023-12-01 08:08:10

我有一个模拟计算在电场和磁场中移动的带电粒子的 3D 矢量。 我试图在 CUDA 中使用 __align__ 说明符 来加快速度，我认为限制因素可能是全局内存读写，但使用 __align__ 最终减慢了速度(可能是因为它增加了总内存需求)。我还尝试使用 float3 和 float4 但它们的性能相似

我创建了此代码的简化版本并将其粘贴在下面以显示我的问题。 下面的代码应该是可编译的，并将第四行的CASE定义更改为0、1或 2，可以尝试我上面描述的不同选项。定义了两个函数，ParticleMoverCPU 和 ParticleMoverGPU 来比较 CPU 和 GPU 的性能。

我对内存合并的尝试正在减慢而不是加速我的代码，这是有原因的吗？
是否还有其他任何明显的事情表明我没有做我本可以做的事情以获得比像这样的“令人尴尬的并行”代码的 60 倍加速更好的方法？

谢谢!

CPU - Intel Xeon E5620 @2.40GHz

GPU - NVIDIA Tesla C2070

// CASE 0: Regular struct with 3 floats
// CASE 1: Aligned struct using __align__(16) with 3 floats
// CASE 2: float3
#define CASE        0   // define to either 0, 1 or 2 as described above

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <Windows.h>

#include <stdio.h>
#include <math.h>
#include <time.h>
#include <malloc.h>
#include <sys/stat.h>

#define CEX         10  // x-value of electric field (dimensionless and arbitrary)
#define CEY         0.1 // y-value of electric field (dimensionless and arbitrary)
#define CEZ         0.1 // z-value of electric field (dimensionless and arbitrary)
#define CBX         0.1 // x-value of magnetic field (dimensionless and arbitrary)
#define CBY         0.1 // x-value of magnetic field (dimensionless and arbitrary)
#define CBZ         10  // x-value of magnetic field (dimensionless and arbitrary)

#define FACTOR      15  // I played around with these numbers until I got the best speedup
#define THREADS     256 // I played around with these numbers until I got the best speedup

typedef struct{
    float x;
    float y;
    float z;
} VecCPU;           //Struct for vectors for CPU calculation

// Fastest method seems to be a regular unaligned struct with 3 floats
#if CASE==0
typedef struct {
    float x;
    float y;
    float z;
} VecGPU;
#endif

#if CASE==1
// This method seems to be less fast.  It is an attempt to align for memory coalescence
typedef struct __align__(16){
    float x;
    float y;
    float z;
} VecGPU;
#endif

// Using float3 seems to be about the same as defining our own vector3 structure
#if CASE==2
typedef float3 VecGPU;
#endif

VecCPU *pos_c, *vel_c;                  // global position and velocity vectors for CPU calculation
__constant__ VecGPU *pos_d, *vel_d;     // pointers in constant memory which we will point to data in global memory

void ParticleMoverCPU(int np, int ts, float dt){

    int n = 0;
    while (n < np){

        VecCPU vminus, tvec, vprime, vplus;
        float tvec_fact;
        int it = 0;
        while (it < ts){
            // ----- Update velocities by the Boris method ------ //
            vminus.x = vel_c[n].x + CEX*0.5*dt;
            vminus.y = vel_c[n].y + CEY*0.5*dt;
            vminus.z = vel_c[n].z + CEZ*0.5*dt;
            tvec.x = CBX*0.5*dt;
            tvec.y = CBY*0.5*dt;
            tvec.z = CBZ*0.5*dt;
            tvec_fact = 2 / (1 + tvec.x*tvec.x + tvec.y*tvec.y + tvec.z*tvec.z);
            vprime.x = vminus.x + vminus.y*tvec.z - vminus.z*tvec.y;
            vprime.y = vminus.y + vminus.z*tvec.x - vminus.x*tvec.z;
            vprime.z = vminus.z + vminus.x*tvec.y - vminus.y*tvec.x;
            vplus.x = vminus.x + (vprime.y*tvec.z - vprime.z*tvec.y)*tvec_fact;
            vplus.y = vminus.y + (vprime.z*tvec.x - vprime.x*tvec.z)*tvec_fact;
            vplus.z = vminus.z + (vprime.x*tvec.y - vprime.y*tvec.x)*tvec_fact;
            vel_c[n].x = vplus.x + CEX*0.5*dt;
            vel_c[n].y = vplus.y + CEY*0.5*dt;
            vel_c[n].z = vplus.z + CEZ*0.5*dt;

            // ------ Update Particle positions -------------- //
            pos_c[n].x += vel_c[n].x*dt;
            pos_c[n].y += vel_c[n].y*dt;
            pos_c[n].z += vel_c[n].z*dt;
            it++;
        }
        n++;
    }
}

__global__ void ParticleMoverGPU(register int np,register int ts, register float dt){

    register int n = threadIdx.x + blockDim.x * blockIdx.x;
    while (n < np){

        register VecGPU vminus, tvec, vprime, vplus;// , vtemp;
        register float tvec_fact;
        register int it = 0;
        while (it < ts){
            // ----- Update velocities by the Boris method ------ //
            vminus.x = vel_d[n].x + CEX*0.5*dt;
            vminus.y = vel_d[n].y + CEY*0.5*dt;
            vminus.z = vel_d[n].z + CEZ*0.5*dt;
            tvec.x = CBX*0.5*dt;
            tvec.y = CBY*0.5*dt;
            tvec.z = CBZ*0.5*dt;
            tvec_fact = 2 / (1 + tvec.x*tvec.x + tvec.y*tvec.y + tvec.z*tvec.z);
            vprime.x = vminus.x + vminus.y*tvec.z - vminus.z*tvec.y;
            vprime.y = vminus.y + vminus.z*tvec.x - vminus.x*tvec.z;
            vprime.z = vminus.z + vminus.x*tvec.y - vminus.y*tvec.x;
            vplus.x = vminus.x + (vprime.y*tvec.z - vprime.z*tvec.y)*tvec_fact;
            vplus.y = vminus.y + (vprime.z*tvec.x - vprime.x*tvec.z)*tvec_fact;
            vplus.z = vminus.z + (vprime.x*tvec.y - vprime.y*tvec.x)*tvec_fact;
            vel_d[n].x = vplus.x + CEX*0.5*dt;
            vel_d[n].y = vplus.y + CEY*0.5*dt;
            vel_d[n].z = vplus.z + CEZ*0.5*dt;
            // ------ Update Particle positions -------------- //
            pos_d[n].x += vel_d[n].x*dt;
            pos_d[n].y += vel_d[n].y*dt;
            pos_d[n].z += vel_d[n].z*dt;
            it++;
        }
        n += blockDim.x*gridDim.x;
    }
}

int main(void){

    int np = 50000;                                         // Number of Particles
    const int ts = 1000;                                    // Number of Time-steps
    const float dt = 1E-3;                                  // Time-step value


    // ----------- CPU ----------- //

    pos_c = (VecCPU*)malloc(sizeof(VecCPU)*np);             // allocate memory for position
    vel_c = (VecCPU*)malloc(sizeof(VecCPU)*np);             // allocate memory for velocity

    for (int n = 0; n < np; n++){
        pos_c[n].x = 0; pos_c[n].y = 0; pos_c[n].z = 0;     // zero out position for CPU variables
        vel_c[n].x = 0; vel_c[n].y = 0; vel_c[n].z = 0;     // zero out velocity for CPU variables
    }

    printf("Starting CPU kernel\n");
    clock_t startCPU;
    float CPUtime;
    startCPU = clock();
    ParticleMoverCPU(np, ts, dt);                           // Launch CPU kernel
    CPUtime = ((float)(clock() - startCPU)) / CLOCKS_PER_SEC;
    printf("CPU kernel finished\n");
    // Ouput final CPU computation time
    printf("CPUtime = %6.1f ms\n", ((float)CPUtime)*1E3);

    // ------------ GPU ----------- //

    cudaFuncSetCacheConfig(ParticleMoverGPU, cudaFuncCachePreferL1);    //Set memory preference to L1 (doesn't have much effect)
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, 0);
    int blocks = deviceProp.multiProcessorCount;

    VecGPU *pos_g, *vel_g, *pos_l, *vel_l;

    pos_g = (VecGPU*)malloc(sizeof(VecGPU)*np);         // allocate memory for positions on the CPU
    vel_g = (VecGPU*)malloc(sizeof(VecGPU)*np);         // allocate memory for velocities on the CPU

    cudaMalloc((void**)&pos_l, sizeof(VecGPU)*np);      // allocate memory for positions on the GPU
    cudaMalloc((void**)&vel_l, sizeof(VecGPU)*np);      // allocate memory for velocities on the GPU

    cudaMemcpyToSymbol(pos_d, &pos_l, sizeof(void*));   // copy memory address of position to the constant memory pointer pos_d
    cudaMemcpyToSymbol(vel_d, &vel_l, sizeof(void*));   // copy memory address of velocity to the constant memory pointer vel_d

    for (int n = 0; n < np; n++){
        pos_g[n].x = 0; pos_g[n].y = 0; pos_g[n].z = 0; // zero out position for GPU variables (before copying to GPU)
        vel_g[n].x = 0; vel_g[n].y = 0; vel_g[n].z = 0; // zero out velocity for GPU variables (before copying to GPU)
    }

    cudaMemcpy(pos_l, pos_g, sizeof(VecGPU)*np, cudaMemcpyHostToDevice);    // Copy positions to GPU global memory
    cudaMemcpy(vel_l, vel_g, sizeof(VecGPU)*np, cudaMemcpyHostToDevice);    // Copy velocities to GPU global memory

    printf("Starting GPU kernel\n");
    // start cuda timer
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start, 0);

    ParticleMoverGPU <<<blocks*FACTOR, THREADS >>>(np, ts, dt);             // Launch GPU kernel

    //stop cuda timer
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    float elapsedTime;
    cudaEventElapsedTime(&elapsedTime, start, stop);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    printf("GPU kernel finished\n");

    cudaMemcpy(pos_g, pos_l, sizeof(VecGPU)*np, cudaMemcpyDeviceToHost);    // Copy positions from GPU memory back to CPU
    cudaMemcpy(vel_g, vel_l, sizeof(VecGPU)*np, cudaMemcpyDeviceToHost);    // Copy velocities from GPU memory back to CPU

    // Ouput GPU computation time
    printf("GPUtime = %6.1f ms\n", elapsedTime);

    // Output speedup factor
    printf("CASE=%i, Speedup = %4.2f\n",CASE, CPUtime*1E3 / elapsedTime);

    // free allocated memory
    cudaFree(pos_l);
    cudaFree(vel_l);
    free(pos_g);
    free(vel_g);
    free(pos_c);
    free(vel_c);
}

对于 CASE 0(常规向量结构)我得到:

CPUtime = 1302.0 ms
GPUtime =   21.8 ms
Speedup = 59.79

对于案例 1(__align__(16) 向量结构)我得到:

CPUtime = 1298.0 ms
GPUtime =   24.5 ms
Speedup = 53.08

对于CASE 2(使用float3)我得到:

CPUtime = 1305.0 ms
GPUtime =   21.8 ms
Speedup = 59.80

如果我使用 float4 而不是 float3 我会得到类似于 __align__(16) 方法的结果。

谢谢!!

最佳答案

__constant__ 内存中的指针是在浪费您的时间。我不确定你为什么要跳过所有这些障碍。
随处丢弃register 是在浪费您的时间。在告诉编译器尽可能使用寄存器方面，您并不比编译器聪明。
以防万一，您应该使用适当的 cuda 错误检查。这只是我所做的样板声明。我认为这段代码中没有任何 API 级别的错误。
您不清楚“聚结”是什么意思。数据对齐只会间接影响内存事务合并的能力。更重要的是 实际地址 由 warp 中的相邻线程为给定的内存事务生成——它们是否指相邻的内存位置？如果是这样，事情可能会很好地融合在一起。如果没有，可能不会。所以你有一个“自然”占用 12 个字节的数据结构，在一种情况下(较慢的那个)你告诉它占用 16 个字节。这到底是做什么的？要回答这个问题，我们必须查看给定的交易:
```
    vminus.x = vel_d[n].x + CEX*0.5*dt;
```
上述事务请求 vel_d 向量的 x 分量。在“非对齐”的情况下，该数据将像这样存储，并且上述事务将“询问”加星号的数量(每个 warp 32):
```
mem idx: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 ...
vel_d:  x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 ...
         *        *        *        *        *        *       ...
```
在“对齐”的情况下，上面的模式看起来像:
```
mem idx: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17    ...
vel_d:  x0 y0 z0 ?? x1 y1 z1 ?? x2 y2 z2 ?? x3 y3 z3 ?? x4 y4 z4 ...
         *           *           *           *           *       ...
```
因此我们看到，当您指定 align 指令时，打包密度较低，并且给定的 128 字节缓存行为给定事务提供更少的必要项。因此，在对齐的情况下，必须从全局内存中检索更多缓存行以满足这一读取请求。这可能是您看到的约 10-20% 差异的原因。
但我们可以做得比上面更好。你有一个经典的 AoS(结构数组)数据存储方案，这对 GPU 编程来说是典型的坏事。标准的性能增强是从 AoS 存储转换为 SoA 存储。这意味着分解 pos 和 vel 的 x、y、z 组件> 将向量放入单独的数组中，每个数组，然后访问它们。 (或者，由于您在单个线程中处理所有组件，您可以尝试执行向量加载。但这是一个 separate discussion。)然后所需的存储和加载模式变为:
```
mem idx:  0  1  2  3  4  5  6  7  8  9  ...
vel_d_x: x0 x1 x2 x3 x4 x5 x6 x7 x8 x9  ...
          *  *  *  *  *  *  *  *  *  *  ...
```
代码可能是这样的:
```
    vminus.x = vel_d_x[n] + CEX*0.5*dt;
    vminus.y = vel_d_y[n] + CEY*0.5*dt;
    vminus.z = vel_d_z[n] + CEZ*0.5*dt;
```

以下代码实现了上述部分内容，包括 GPU 端的 AoS -> SoA 转换，并且应该比您的任何情况都快。

$ cat t895.cu
// CASE 0: Regular struct with 3 floats
// CASE 1: Aligned struct using __align__(16) with 3 floats
// CASE 2: float3
#define CASE        0   // define to either 0, 1 or 2 as described above

#include <stdio.h>
#include <math.h>
#include <time.h>
#include <malloc.h>
#include <sys/stat.h>

#define CEX         10  // x-value of electric field (dimensionless and arbitrary)
#define CEY         0.1 // y-value of electric field (dimensionless and arbitrary)
#define CEZ         0.1 // z-value of electric field (dimensionless and arbitrary)
#define CBX         0.1 // x-value of magnetic field (dimensionless and arbitrary)
#define CBY         0.1 // x-value of magnetic field (dimensionless and arbitrary)
#define CBZ         10  // x-value of magnetic field (dimensionless and arbitrary)

#define FACTOR      15  // I played around with these numbers until I got the best speedup
#define THREADS     256 // I played around with these numbers until I got the best speedup

typedef struct{
    float x;
    float y;
    float z;
} VecCPU;           //Struct for vectors for CPU calculation

// Fastest method seems to be a regular unaligned struct with 3 floats
#if CASE==0
typedef struct {
    float x;
    float y;
    float z;
} VecGPU;
#endif

#if CASE==1
// This method seems to be less fast.  It is an attempt to align for memory coalescence
typedef struct __align__(16){
    float x;
    float y;
    float z;
} VecGPU;
#endif

// Using float3 seems to be about the same as defining our own vector3 structure
#if CASE==2
typedef float3 VecGPU;
#endif

VecCPU *pos_c, *vel_c;                  // global position and velocity vectors for CPU calculation

void ParticleMoverCPU(int np, int ts, float dt){

    int n = 0;
    while (n < np){

        VecCPU vminus, tvec, vprime, vplus;
        float tvec_fact;
        int it = 0;
        while (it < ts){
            // ----- Update velocities by the Boris method ------ //
            vminus.x = vel_c[n].x + CEX*0.5*dt;
            vminus.y = vel_c[n].y + CEY*0.5*dt;
            vminus.z = vel_c[n].z + CEZ*0.5*dt;
            tvec.x = CBX*0.5*dt;
            tvec.y = CBY*0.5*dt;
            tvec.z = CBZ*0.5*dt;
            tvec_fact = 2 / (1 + tvec.x*tvec.x + tvec.y*tvec.y + tvec.z*tvec.z);
            vprime.x = vminus.x + vminus.y*tvec.z - vminus.z*tvec.y;
            vprime.y = vminus.y + vminus.z*tvec.x - vminus.x*tvec.z;
            vprime.z = vminus.z + vminus.x*tvec.y - vminus.y*tvec.x;
            vplus.x = vminus.x + (vprime.y*tvec.z - vprime.z*tvec.y)*tvec_fact;
            vplus.y = vminus.y + (vprime.z*tvec.x - vprime.x*tvec.z)*tvec_fact;
            vplus.z = vminus.z + (vprime.x*tvec.y - vprime.y*tvec.x)*tvec_fact;
            vel_c[n].x = vplus.x + CEX*0.5*dt;
            vel_c[n].y = vplus.y + CEY*0.5*dt;
            vel_c[n].z = vplus.z + CEZ*0.5*dt;

            // ------ Update Particle positions -------------- //
            pos_c[n].x += vel_c[n].x*dt;
            pos_c[n].y += vel_c[n].y*dt;
            pos_c[n].z += vel_c[n].z*dt;
            it++;
        }
        n++;
    }
}

__global__ void ParticleMoverGPU(float *vel_d_x, float *vel_d_y, float *vel_d_z, float *pos_d_x, float *pos_d_y, float *pos_d_z, int np,int ts, float dt){

    int n = threadIdx.x + blockDim.x * blockIdx.x;
    while (n < np){

        VecGPU vminus, tvec, vprime, vplus;// , vtemp;
        register float tvec_fact;
        register int it = 0;
        while (it < ts){
            // ----- Update velocities by the Boris method ------ //
            vminus.x = vel_d_x[n] + CEX*0.5*dt;
            vminus.y = vel_d_y[n] + CEY*0.5*dt;
            vminus.z = vel_d_z[n] + CEZ*0.5*dt;
            tvec.x = CBX*0.5*dt;
            tvec.y = CBY*0.5*dt;
            tvec.z = CBZ*0.5*dt;
            tvec_fact = 2 / (1 + tvec.x*tvec.x + tvec.y*tvec.y + tvec.z*tvec.z);
            vprime.x = vminus.x + vminus.y*tvec.z - vminus.z*tvec.y;
            vprime.y = vminus.y + vminus.z*tvec.x - vminus.x*tvec.z;
            vprime.z = vminus.z + vminus.x*tvec.y - vminus.y*tvec.x;
            vplus.x = vminus.x + (vprime.y*tvec.z - vprime.z*tvec.y)*tvec_fact;
            vplus.y = vminus.y + (vprime.z*tvec.x - vprime.x*tvec.z)*tvec_fact;
            vplus.z = vminus.z + (vprime.x*tvec.y - vprime.y*tvec.x)*tvec_fact;
            vel_d_x[n] = vplus.x + CEX*0.5*dt;
            vel_d_y[n] = vplus.y + CEY*0.5*dt;
            vel_d_z[n] = vplus.z + CEZ*0.5*dt;
            // ------ Update Particle positions -------------- //
            pos_d_x[n] += vel_d_x[n]*dt;
            pos_d_y[n] += vel_d_y[n]*dt;
            pos_d_z[n] += vel_d_z[n]*dt;
            it++;
        }
        n += blockDim.x*gridDim.x;
    }
}

int main(void){

    int np = 50000;                                         // Number of Particles
    const int ts = 1000;                                    // Number of Time-steps
    const float dt = 1E-3;                                  // Time-step value


    // ----------- CPU ----------- //

    pos_c = (VecCPU*)malloc(sizeof(VecCPU)*np);             // allocate memory for position
    vel_c = (VecCPU*)malloc(sizeof(VecCPU)*np);             // allocate memory for velocity

    for (int n = 0; n < np; n++){
        pos_c[n].x = 0; pos_c[n].y = 0; pos_c[n].z = 0;     // zero out position for CPU variables
        vel_c[n].x = 0; vel_c[n].y = 0; vel_c[n].z = 0;     // zero out velocity for CPU variables
    }

    printf("Starting CPU kernel\n");
    clock_t startCPU;
    float CPUtime;
    startCPU = clock();
    ParticleMoverCPU(np, ts, dt);                           // Launch CPU kernel
    CPUtime = ((float)(clock() - startCPU)) / CLOCKS_PER_SEC;
    printf("CPU kernel finished\n");
    // Ouput final CPU computation time
    printf("CPUtime = %6.1f ms\n", ((float)CPUtime)*1E3);

    // ------------ GPU ----------- //

    cudaFuncSetCacheConfig(ParticleMoverGPU, cudaFuncCachePreferL1);    //Set memory preference to L1 (doesn't have much effect)
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, 0);
    int blocks = deviceProp.multiProcessorCount;

    float *pos_g_x, *pos_g_y, *pos_g_z, *vel_g_x, *vel_g_y, *vel_g_z, *pos_l_x, *pos_l_y, *pos_l_z, *vel_l_x, *vel_l_y, *vel_l_z;

    pos_g_x = (float*)malloc(sizeof(float)*np);         // allocate memory for positions on the CPU
    vel_g_x = (float*)malloc(sizeof(float)*np);         // allocate memory for velocities on the CPU
    pos_g_y = (float*)malloc(sizeof(float)*np);         // allocate memory for positions on the CPU
    vel_g_y = (float*)malloc(sizeof(float)*np);         // allocate memory for velocities on the CPU
    pos_g_z = (float*)malloc(sizeof(float)*np);         // allocate memory for positions on the CPU
    vel_g_z = (float*)malloc(sizeof(float)*np);         // allocate memory for velocities on the CPU

    cudaMalloc((void**)&pos_l_x, sizeof(float)*np);      // allocate memory for positions on the GPU
    cudaMalloc((void**)&vel_l_x, sizeof(float)*np);      // allocate memory for velocities on the GPU
    cudaMalloc((void**)&pos_l_y, sizeof(float)*np);      // allocate memory for positions on the GPU
    cudaMalloc((void**)&vel_l_y, sizeof(float)*np);      // allocate memory for velocities on the GPU
    cudaMalloc((void**)&pos_l_z, sizeof(float)*np);      // allocate memory for positions on the GPU
    cudaMalloc((void**)&vel_l_z, sizeof(float)*np);      // allocate memory for velocities on the GPU

    for (int n = 0; n < np; n++){
        pos_g_x[n] = 0; pos_g_y[n] = 0; pos_g_z[n] = 0; // zero out position for GPU variables (before copying to GPU)
        vel_g_x[n] = 0; vel_g_y[n] = 0; vel_g_z[n] = 0; // zero out velocity for GPU variables (before copying to GPU)
    }

    cudaMemcpy(pos_l_x, pos_g_x, sizeof(float)*np, cudaMemcpyHostToDevice);    // Copy positions to GPU global memory
    cudaMemcpy(vel_l_x, vel_g_x, sizeof(float)*np, cudaMemcpyHostToDevice);    // Copy velocities to GPU global memory
    cudaMemcpy(pos_l_y, pos_g_y, sizeof(float)*np, cudaMemcpyHostToDevice);    // Copy positions to GPU global memory
    cudaMemcpy(vel_l_y, vel_g_y, sizeof(float)*np, cudaMemcpyHostToDevice);    // Copy velocities to GPU global memory
    cudaMemcpy(pos_l_z, pos_g_z, sizeof(float)*np, cudaMemcpyHostToDevice);    // Copy positions to GPU global memory
    cudaMemcpy(vel_l_z, vel_g_z, sizeof(float)*np, cudaMemcpyHostToDevice);    // Copy velocities to GPU global memory

    printf("Starting GPU kernel\n");
    // start cuda timer
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start, 0);

    ParticleMoverGPU <<<blocks*FACTOR, THREADS >>>(vel_l_x, vel_l_y, vel_l_z, pos_l_x, pos_l_y, pos_l_z, np, ts, dt);             // Launch GPU kernel

    //stop cuda timer
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    float elapsedTime;
    cudaEventElapsedTime(&elapsedTime, start, stop);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    printf("GPU kernel finished\n");


    // Ouput GPU computation time
    printf("GPUtime = %6.1f ms\n", elapsedTime);

    // Output speedup factor
    printf("CASE=%i, Speedup = %4.2f\n",CASE, CPUtime*1E3 / elapsedTime);

}

$ nvcc -O3 -o t895 t895.cu
$ ./t895
Starting CPU kernel
CPU kernel finished
CPUtime =  923.6 ms
Starting GPU kernel
GPU kernel finished
GPUtime =   12.3 ms
CASE=0, Speedup = 74.95
$

关于CUDA 结构对齐正在减慢我的代码(可编译示例)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32233518/

文章推荐： r - 指定图例列中的因子数

文章推荐： r - cbind 放置 1 而不是字符

文章推荐： VIM 格式化/对齐

ios7 - 减慢 UISnapBehavior
我正在使用 UISnapBehavior，但它的捕捉速度太快了，我不喜欢。有没有办法让它慢下来？或者换句话说:有没有办法用它应该捕捉的点来调整物体的弹性？最佳答案我能够通过将 View 附加到 U
eclipse - 减慢 SWTBot 执行速度
我想减慢 SWTBot 的执行速度。我已经找到了这个 wiki: https://wiki.eclipse.org/SWTBot/FAQ#Can_I_slow_down_the_execution_
debugging - 减慢 gdb 以重现错误
我的应用程序中有一个计时错误，只有在我使用 valgrind 时才会发生，因为 valgrind 会大大减慢进程的速度。 (它实际上是一个我无法本地化的 boost::weak_ptr-excepti
javascript - 减慢 Canvas 上图像的移动速度
问题我正在创建一个涉及躲避射弹的游戏。玩家控制着一艘船的图像，我不希望船完全一起移动，因为这看起来非常不现实。问题有没有办法控制图像移动的速度，如何减慢图像的移动速度？代码 var game
ios - 减慢 CADisplayLink 间隔
我在我的 iOS 应用程序中使用了 NSTimer，但由于 SetNeedsDisplay，我没有得到我想要的结果。我做了一些研究并找到了 CADisplayLink，它为我提供了我想要的动画结果。
java - 减慢 Java For 循环中对象的移动速度
我目前正在开发一个项目，当按下按钮时，该项目会将圆从一个空间移动到另一个空间。我的设计如下:当按下按钮时，它会在 for 循环中从 0 到 10 增加圆的坐标。问题是，我想要的 for 循环运动没有
swift - 减慢 CAGradientLayer 动画速度
我想缓慢地制作一个三色渐变动画。我有一个自定义UIView，如下所示: class MyView: UIView, CAAnimationDelegate { lazy var gradient
swift - 减慢 didReceiveMemoryWarning() 中的进程
当 RAM 达到 x 内存量或调用 didReceiveMemoryWarning() 时，是否有办法减慢处理器速度？ func didReceiveMemoryWarning() { sup
ios - 减慢 UITableView 中的行插入动画
有没有办法减慢行插入/删除动画的速度？在我的特殊情况下，我通过在我的单元格下方添加/删除行来扩展/折叠单元格，我想稍微放慢动画速度。最佳答案我正在使用以下技巧在我的项目中以动画方式插入/删除表格
javascript - 减慢 Bootstrap 滚动顶部
我的 Logo 和页脚中有 scroll-top 属性，但我离页面顶部越远，它向上滚动的速度就越快!所以当我从页面底部滚动到顶部时，它就像火箭一样!我将如何放慢速度？我找不到足够具体的答案可以看看l
ios - 减慢 UIDynamicAnimator 的动画
我想放慢由我的 UIDynamicAnimator 生成的动画，以便我可以微调我的 UIDynamicBehaviors。在 ios 模拟器中，调试菜单下有一个菜单选项，标签为“在最前面的应用程序中
ios - 减慢 iOS 动画调试？
在 OS X 上，可以按住 Shift 键使动画变慢。有什么方法可以通过远程调试器或 Instruments 将其应用于 iOS 吗？ (或者，我可以在 QuickTime 中录制并逐帧回放，但我宁愿
css - 减慢 CSS 不透明度动画
我想在 .opacity CSS 属性中减慢动画时间。就像，我希望它延迟 0.2 毫秒或类似的东西。为了获得更好的想法，将鼠标悬停在我网站上的精选帖子上时会添加不透明度:http://www.the
swift - 减慢 UIPageViewController 分页
我希望我的 UIPageViewController 在用户的手指离开屏幕时缓慢滚动到下一页。比默认情况下慢。如果可能的话，对其减速曲线等进行更多控制。我不想使用 SCPageViewControl
javascript - 减慢 javascript 自动滚动功能
我发现了这个 javascript 自动滚动函数，并通过将其粘贴到 WordPress 站点的头文件中来使其工作。但是，我想减慢滚动速度，以便它不会立即捕捉到页面底部。我是 javascript 的
ios - 减慢 UIScrollView 滚动
我正在使用 UIScrollView 以编程方式为某些内容设置动画。但是，我需要减慢 View 的滚动速度。这是我用于滚动的代码: self.scrollView.setContentOffset
javascript - 减慢 Jquery 滚动的速度
我一直在使用 jQuery 滚动来增强我的视差滚动页面。具体来说就是这个。 JQuery Scroll to Next Section 我对 jQuery 完全陌生(过去只使用过一些相当基本的 Jav
c++ - 减慢 Windows 进程？
如何减慢 Windows 进程？我知道我需要 Hook QueryPerformanceCounter 但接下来我需要做什么？需要 Delphi 或 C++ 方面的帮助最佳答案我不确定我是否理
tcp - 在服务器端控制(减慢)下载
我想在我这边控制下载量/速度——在服务器端也一样(礼貌一点)。...不是“我自己的下载管理器”。让我们想象一下:我允许我的儿子每天从 utube 下载最多 500Mb，但他仍然启动了一个 sessi
javascript - 减慢的执行速度
在我的网站上，我有多个 href's，我需要在点击它们和加载它们之间添加延迟。由于有数百个 hrefs，我不能为每个单独的 js 函数。我研究过的两种方法是，将 href 的内容作为变量传递给 ja

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

CUDA 结构对齐正在减慢我的代码(可编译示例)