gpt4 book ai didi

c++ - 为什么CUDA不会导致C++代码加速?

转载 作者:行者123 更新时间:2023-12-02 10:21:12 25 4
gpt4 key购买 nike

我正在使用VS2019并具有NVIDIA GeForce GPU。我从以下链接尝试了代码:https://towardsdatascience.com/writing-lightning-fast-code-with-cuda-c18677dcdd5f

该文章的作者声称使用CUDA可以加快速度。但是,对我来说,串行版本大约需要7毫秒,而CUDA版本大约需要28毫秒。为什么此代码的CUDA速度较慢?我使用的代码如下:

__global__
void add(int n, float* x, float* y)
{

int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}

void addSerial(int n, float* x, float* y)
{
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}

int main()
{
int NSerial = 1 << 20;
float* xSerial = new float[NSerial];
float* ySerial = new float[NSerial];
for (int i = 0; i < NSerial; i++) {
xSerial[i] = 1.0f;
ySerial[i] = 2.0f;
}
auto t1Serial = std::chrono::high_resolution_clock::now();
addSerial(NSerial, xSerial, ySerial);
auto t2Serial = std::chrono::high_resolution_clock::now();
auto durationSerial = std::chrono::duration_cast<std::chrono::milliseconds>(t2Serial - t1Serial).count();
float maxErrorSerial = 0.0f;
for (int i = 0; i < NSerial; i++)
maxErrorSerial = fmax(maxErrorSerial, fabs(ySerial[i] - 3.0f));
std::cout << "Max error Serial: " << maxErrorSerial << std::endl;
std::cout << "durationSerial: "<<durationSerial << std::endl;
delete[] xSerial;
delete[] ySerial;


int N = 1 << 20;

float* x, * y;
cudaMallocManaged(&x, N * sizeof(float));
cudaMallocManaged(&y, N * sizeof(float));

for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}


int device = -1;
cudaGetDevice(&device);
cudaMemPrefetchAsync(x, N * sizeof(float), device, NULL);
cudaMemPrefetchAsync(y, N * sizeof(float), device, NULL);


int blockSize = 1024;
int numBlocks = (N + blockSize - 1) / blockSize;
auto t1 = std::chrono::high_resolution_clock::now();
add << <numBlocks, blockSize >> > (N, x, y);

cudaDeviceSynchronize();
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count();

float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i] - 3.0f));
std::cout << "Max error: " << maxError << std::endl;
std::cout << "duration CUDA: "<<duration;

cudaFree(x);
cudaFree(y);



return 0;
}

最佳答案

这里有几个观察结果:

  • CUDA内核的第一次调用可能会累积大量与GPU上的设置相关的一次延迟,因此通常的方法是包括一个“热身”调用
  • 您问题中的内核设计是“驻留”设计,因此,当您仅启动完全占用GPU所需的块数时,应该执行最佳执行。有一个API可用于获取GPU的此信息。
  • 以微秒而非毫秒为单位执行计时
  • 以 Release模式构建代码。

  • 对您的CUDA代码执行所有这些操作,即可获得以下信息:
        int N = 1 << 20;   
    int device = -1;
    cudaGetDevice(&device);

    float* x, * y;
    cudaMallocManaged(&x, N * sizeof(float));
    cudaMallocManaged(&y, N * sizeof(float));

    for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
    }
    cudaMemPrefetchAsync(x, N * sizeof(float), device, NULL);
    cudaMemPrefetchAsync(y, N * sizeof(float), device, NULL);

    int blockSize, numBlocks;
    cudaOccupancyMaxPotentialBlockSize(&numBlocks, &blockSize, add);

    for(int rep=0; rep<10; rep++) {
    auto t1 = std::chrono::high_resolution_clock::now();
    add << <numBlocks, blockSize >> > (N, x, y);
    cudaDeviceSynchronize();
    auto t2 = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
    std::cout << rep << " duration CUDA: " << duration <<std::endl;
    }

    float maxError = 0.0f;
    for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i] - 12.0f));
    std::cout << "Max error: " << maxError << std::endl;

    cudaFree(x);
    cudaFree(y);

    并构建并运行它:
    $ nvcc -arch=sm_52 -std=c++11 -o not_so_fast not_so_fast.cu 
    $ ./not_so_fast
    Max error Serial: 0
    durationSerial: 2762
    0 duration CUDA: 1074
    1 duration CUDA: 150
    2 duration CUDA: 151
    3 duration CUDA: 158
    4 duration CUDA: 152
    5 duration CUDA: 152
    6 duration CUDA: 147
    7 duration CUDA: 124
    8 duration CUDA: 112
    9 duration CUDA: 113
    Max error: 0

    在我的系统上,第一个GPU的运行速度是串行循环的近三倍。第二次及以后的运行再次快了将近10倍。您的结果可能(可能会)有所不同。

    关于c++ - 为什么CUDA不会导致C++代码加速?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60085669/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com