gpt4 book ai didi

c - 测量内核使用流时所花费的总时间

转载 作者:行者123 更新时间:2023-11-30 17:38:44 25 4
gpt4 key购买 nike

我正在分析运行多次的内核上花费的总时间,并且想知道这段代码是否会给我流式内核上的总花费,或者返回的时间是否需要乘以启动次数。

cudaEvent_t start, stop;    
cudaEventCreate(&start);
cudaEventCreate(&stop);


for(x=0; x<SIZE; x+=N*2){

gpuErrchk(cudaMemcpyAsync(data_d0, data_h+x, N*sizeof(char), cudaMemcpyHostToDevice, stream0));
gpuErrchk(cudaMemcpyAsync(data_d1, data_h+x+N, N*sizeof(char), cudaMemcpyHostToDevice, stream1));


gpuErrchk(cudaMemcpyAsync(array_d0, array_h, wrap->size*sizeof(node_r), cudaMemcpyHostToDevice, stream0));
gpuErrchk(cudaMemcpyAsync(array_d1, array_h, wrap->size*sizeof(node_r), cudaMemcpyHostToDevice, stream1));

cudaEventRecord(start, 0);
GPU<<<N/512,512,0,stream0>>>(array_d0, data_d0, out_d0 );
GPU<<<N/512,512,0,stream1>>>(array_d1, data_d1, out_d1);
cudaEventRecord(stop, 0);

gpuErrchk(cudaMemcpyAsync(out_h+x, out_d0 , N * sizeof(int), cudaMemcpyDeviceToHost, stream0));
gpuErrchk(cudaMemcpyAsync(out_h+x+N, out_d1 ,N * sizeof(int), cudaMemcpyDeviceToHost, stream1));

}

float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
printf("Time %f ms\n", elapsedTime);

最佳答案

它不会捕获循环所有 channel 的内核总执行时间。

来自documentation :

If cudaEventRecord() has previously been called on event, then this call will overwrite any existing state in event. Any subsequent calls which examine the status of event will only examine the completion of this most recent call to cudaEventRecord().

如果您认为循环中每次传递的执行时间大致相同,那么您只需将结果乘以传递次数即可。

请注意,您应该发出 cudaEventSynchronize()在调用 cudaEventElapsedTime()

之前调用 stop 事件

关于c - 测量内核使用流时所花费的总时间,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22048981/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com