gpt4 book ai didi

gpgpu - DirectX 11 计算着色器设备同步?

转载 作者:行者123 更新时间:2023-12-01 10:26:43 24 4
gpt4 key购买 nike

背景:在 GPGPU 平台上执行基准测试/比较。

问题:调度 DirectX 11 计算着色器时的设备同步。

寻找 clFinish(...)cudaDeviceSynchronize() 的等价物,以公平比较我的算法的执行情况。

CUDA 和 OpenCL 函数在阻塞/非阻塞问题上更加清晰。然而,DirectCompute 与图形管道更相关(我正在学习但非常不熟悉),因此我很难确定 Dispatch 调用是否正在阻塞或之前的内存分配/传输是否已完成。

代码 DX_1:

// Setup
...
for (...) {
startTimer();
context->Dispatch(number_of_groups, 1, 1);
times[i] = stopTimer();
}
// Release
...

代码 DX_2:

for (...) {
// Setup
...
startTimer();
context->Dispatch(number_of_groups, 1, 1);
times[i] = stopTimer();
// Release
...
}

结果(2^2 到 2^11 个元素的平均次数):

DX_1  DX_2   CUDA
1.6 205.5 24.8
1.8 133.4 24.8
29.1 186.5 25.6
18.6 175.0 25.6
11.4 187.5 26.6
85.2 127.7 26.3
166.4 151.1 28.1
98.2 149.5 35.2
26.8 203.5 31.6

注意:这些时间是在连接屏幕的桌面 GPU 上运行的,预计会有一些不稳定的时间。时间不应包括主机到设备的缓冲区传输。

注意 2:这些是非常短的序列(4 - 2048 个元素)有趣的测试是针对最多 2^26 个元素的问题大小进行的。

最佳答案

我的新解决方案是避免与设备同步。相反,我研究了一些检索时间戳的方法,结果看起来不错,而且我相当确定比较是公平的。我比较了我的 CUDA 时间(Event Record 与 QPC),差异很小,开销似乎是恒定的。

CUDA Event  Host QPC
4,6 30,0
4,8 30,0
5,0 31,0
5,2 32,0
5,6 34,0
6,1 34,0
6,9 31,0
8,3 47,0
9,2 34,0
12,0 39,0
16,7 46,0
20,5 55,0
32,1 69,0
48,5 111,0
86,0 134,0
182,4 237,0
419,0 473,0

如果我的问题让某人希望找到如何进行 gpgpu 基准测试,我将留下一些代码来演示我当前的基准测试策略。

代码示例,CUDA

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
float milliseconds = 0;
cudaEventRecord(start);
...
// Launch my algorithm
...
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&milliseconds, start, stop);

OpenCL

cl_event start_event, end_event;
cl_ulong start = 0, end = 0;
// Enqueue a dummy kernel for the start event.
clEnqueueNDRangeKernel(..., &start_event);
...
// Launch my algorithm
...
// Enqueue a dummy kernel for the end event.
clEnqueueNDRangeKernel(..., &end_event);
clWaitForEvents(1, &end_event);
clGetEventProfilingInfo(start_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
clGetEventProfilingInfo(end_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
timeInMS = (double)(end - start)*(double)(1e-06);

直接计算

在这里,我遵循了 Adam Miles 的建议并调查了该来源。看起来像这样:

ID3D11Device*               device = nullptr;
...
// Setup
...
ID3D11QueryPtr disjoint_query;
ID3D11QueryPtr q_start;
ID3D11QueryPtr q_end;
...
if (disjoint_query == NULL)
{
D3D11_QUERY_DESC desc;
desc.Query = D3D11_QUERY_TIMESTAMP_DISJOINT;
desc.MiscFlags = 0;
device->CreateQuery(&desc, &disjoint_query);
desc.Query = D3D11_QUERY_TIMESTAMP;
device->CreateQuery(&desc, &q_start);
device->CreateQuery(&desc, &q_end);
}
context->Begin(disjoint_query);
context->End(q_start);
...
// Launch my algorithm
...
context->End(q_end);
context->End(disjoint_query);
UINT64 start, end;
D3D11_QUERY_DATA_TIMESTAMP_DISJOINT q_freq;
while (S_OK != context->GetData(q_start, &start, sizeof(UINT64), 0)){};
while (S_OK != context->GetData(q_end, &end, sizeof(UINT64), 0)){};
while (S_OK != context->GetData(disjoint_query, &q_freq, sizeof(D3D11_QUERY_DATA_TIMESTAMP_DISJOINT), 0)){};
timeInMS = (((double)(end - start)) / ((double)q_freq.Frequency)) * 1000.0;

C/C++/OpenMP

static LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds, Frequency;

static void __inline startTimer()
{
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
}

static double __inline stopTimer()
{
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
return (double)ElapsedMicroseconds.QuadPart;
}

我的代码示例是断章取义的,我尝试进行一些清理,但可能会出现错误。

关于gpgpu - DirectX 11 计算着色器设备同步?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33277472/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com