gpt4 book ai didi

c++ - 为什么我的程序在我的 CPU 设备上运行速度明显快于我的 GPU 设备?

转载 作者:行者123 更新时间:2023-11-30 05:21:44 24 4
gpt4 key购买 nike

我是 OpenCL 的新手。我在大学里学了一点,我的意思是我的图形学教授只教了我们 GPGPU 和 OpenCL 一天(而类(class)的其余部分则专注于着色器和 OpenGL 等)。

我采用了一个示例程序并将其更改为适用于我希望它运行的计算。但是,我的程序在 CPU 上的运行速度明显快于我的 GPU,我正在尝试了解原因。

我的程序接受一个输入 float 组并有两个输出数组。在单线程情况下它有三个参数。输入数组的大小为:samplesPerTracetracesInsizeof(float),输出数组的大小为:samplesPerTracetracesOutsizeof(float)。

我的测试用例一直使用参数 25000 2500 250,因为这是我将要使用的数组大小的平均值(可能略高于平均值)。这些值是随机填充的。

这里是OpenCL在内核上构建和运行的源代码;

const char* M_AND_S_OPENCL_SOURCE_TEXT =
"__kernel void sumAllCL(__global const float prestackTraces[],\n"
" __global float stackTracesOut[],\n"
" __global float powerTracesOut[], const unsigned int nTracesOut, const unsigned int nTracesIn,\n"
" const unsigned int samplesPerTrace) {\n"
"\n"
" unsigned int k = get_global_id(0);\n" // Thread ID
"\n"
" unsigned int kTimesIn = k * nTracesIn;\n" // Store repeat ints
" unsigned int kTimesSamples = k * samplesPerTrace;\n"
"\n"
" for (int j = 0; j < ? ; j++) {\n" // ? position to be replaced (nTracesOut)"
"\n"
" int jTimesSamplesPT = j * samplesPerTrace;\n"
"\n"
" for (int i = 0; i < # ; i++) {\n" // # position to be replaced ()
"\n"
" int valueIndex = i + jTimesSamplesPT;\n"
" float value = prestackTraces[valueIndex];\n"
"\n"
" stackTracesOut[i + kTimesSamples] += value;\n"
" powerTracesOut[i + kTimesSamples] += (value * value);\n"
"\n"
" }\n"
" }\n"
"}\n";

注意 ?并且 # 在运行时被替换为固定数字,我这样做是因为我认为这将帮助编译器展开 rl

使用上述参数 (25000 2500 250 ~10 <1 or 2>),我的 CPU 大约需要 0.6 秒来完成程序,而我的 GPU 大约需要 40 秒才能完成。这是一个更大的区别。仅供引用,我一直在弄乱第 4 个参数,看看哪个值运行得更快,这就是 ~10 的意思。

我的显卡是 MSI Radeon R9 390X 8GB,名称为 Hawaii。当我让 OpenCL 打印出关于我的两个设备的信息时,这就是我得到的:

OpenCL Platform 0: AMD Accelerated Parallel Processing
----- OpenCL Device # 0: Hawaii-----
Gflops: 47.520000
Max Clock Frequency: 1080
Max Compute Units: 44
Max Work Group Size: 256
MEMORY...
Total Memory of Device: 8.000G (CL_DEVICE_GLOBAL_MEM_SIZE)
Local Memory of Device: 32.000K (CL_DEVICE_LOCAL_MEM_SIZE)
Max Memory Object Allocation: 3.999G (CL_DEVICE_MAX_MEM_ALLOC_SIZE)
Cache Size: 16.000K (CL_DEVICE_GLOBAL_MEM_CACHE_SIZE)
Cacheline Size: 64 bytes (CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE)
VERSIONS...
Device Vendor: Advanced Micro Devices, Inc.
Device Version: OpenCL 2.0 AMD-APP (2117.13)
Driver Version: 2117.13 (VM)
Device OpenCL Version: OpenCL C 2.0
----- OpenCL Device # 1: Intel(R) Core(TM) i7-6700K CPU ? 4.00GHz-----
Gflops: 32.064000
Max Clock Frequency: 4008
Max Compute Units: 8
Max Work Group Size: 1024
MEMORY...
Total Memory of Device: 15.967G (CL_DEVICE_GLOBAL_MEM_SIZE)
Local Memory of Device: 32.000K (CL_DEVICE_LOCAL_MEM_SIZE)
Max Memory Object Allocation: 3.1028G (CL_DEVICE_MAX_MEM_ALLOC_SIZE)
Cache Size: 32.000K (CL_DEVICE_GLOBAL_MEM_CACHE_SIZE)
Cacheline Size: 64 bytes (CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE)
VERSIONS...
Device Vendor: GenuineIntel
Device Version: OpenCL 1.2 AMD-APP (2117.13)
Driver Version: 2117.13 (sse2,avx)
Device OpenCL Version: OpenCL C 1.2

这是与 OpenCL 相关的代码。我会发布一个完整的最小可验证完整示例,但它超出了字符数限制。

/*
* Prints the given int (numToInsert) at location inside chars.
*/
void PrintIntInStr(char* chars, int location, int numToInsert) {

std::stringstream strs;
strs << numToInsert;
std::string temp_str = strs.str();
char const* numToChars = temp_str.c_str();

int numberLength = strlen(numToChars);

int w;
for (w = 0; w < numberLength; w++) {
chars[location + w] = numToChars[w];
}
}

/*
* Initialize fastest OpenCL device.
*/
int InitOpenCL(int verbose, cl_int deviceType) {

cl_uint Nplat;
cl_int err;
char name[1024];
int MaxGflops = -1;

cl_platform_id winnerPlatform = 0;

// Reset (TODO)
_deviceID = NULL;
_context = NULL;
_queue = NULL;

// Get platforms
cl_platform_id platforms[4];
if (clGetPlatformIDs(4, platforms, &Nplat)) Fatal("Cannot get number of OpenCL platforms\n");
else if (Nplat<1) Fatal("No OpenCL platforms found\n");

// Loop over platforms
for (unsigned int platform = 0; platform < Nplat; platform++) {

if (clGetPlatformInfo(platforms[platform], CL_PLATFORM_NAME, sizeof(name), name, NULL)) Fatal("Cannot get OpenCL platform name\n");
if (verbose) printf("OpenCL Platform %d: %s\n", platform, name);

// Get GPU device IDs
cl_uint Ndev;
cl_device_id id[4];
if (clGetDeviceIDs(platforms[platform], deviceType, 4, id, &Ndev))
Fatal("Cannot get number of OpenCL devices: %d\n", platform);
else if (Ndev < 1) Fatal("No OpenCL devices found.\n");

// Find the fastest device
for (unsigned int devId = 0; devId < Ndev; devId++) {

// Print informatio about the device
cl_uint compUnits, freq, cacheLineSize;
cl_ulong memSize, maxAlloc, localMemSize, globalCacheSize;
size_t maxWorkGrps;
char deviceVendor[50];
char deviceVersion[50];
char driverVersion[50];
char deviceOpenCLVersion[50];

// Computing Power...
if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(compUnits), &compUnits, NULL)) Fatal("Cannot get OpenCL device units\n");
if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(freq), &freq, NULL)) Fatal("Cannot get OpenCL device frequency\n");
if (clGetDeviceInfo(id[devId], CL_DEVICE_NAME, sizeof(name), name, NULL)) Fatal("Cannot get OpenCL device name\n");
if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(maxWorkGrps), &maxWorkGrps, NULL)) Fatal("Cannot get OpenCL max work group size\n");
// Memory...
if (clGetDeviceInfo(id[devId], CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(memSize), &memSize, NULL)) Fatal("Cannot get OpenCL memory size.\n");
if (clGetDeviceInfo(id[devId], CL_DEVICE_LOCAL_MEM_SIZE, sizeof(localMemSize), &localMemSize, NULL)) localMemSize = 0;
if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_MEM_ALLOC_SIZE, sizeof(maxAlloc), &maxAlloc, NULL)) Fatal("Cannot get OpenCL memory size.\n");
if (clGetDeviceInfo(id[devId], CL_DEVICE_GLOBAL_MEM_CACHE_SIZE, sizeof(globalCacheSize), &globalCacheSize, NULL)) globalCacheSize = 0;
if (clGetDeviceInfo(id[devId], CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE, sizeof(cacheLineSize), &cacheLineSize, NULL)) cacheLineSize = 0;
// Versions...
clGetDeviceInfo(id[devId], CL_DEVICE_VENDOR, sizeof(deviceVendor), deviceVendor, NULL);
clGetDeviceInfo(id[devId], CL_DEVICE_VERSION, sizeof(deviceVersion), deviceVersion, NULL);
clGetDeviceInfo(id[devId], CL_DRIVER_VERSION, sizeof(driverVersion), driverVersion, NULL);
clGetDeviceInfo(id[devId], CL_DEVICE_OPENCL_C_VERSION, sizeof(deviceOpenCLVersion), deviceOpenCLVersion, NULL);

int Gflops = compUnits * freq;

if (verbose) printf(" ----- OpenCL Device # %d: %s-----\n"
"Gflops: %f\n"
"Max Clock Frequency: %d\n"
"Max Compute Units: %d\n"
"Max Work Group Size: %zu\n"
" MEMORY...\n"
"Total Memory of Device: %s (CL_DEVICE_GLOBAL_MEM_SIZE)\n"
"Local Memory of Device: %s (CL_DEVICE_LOCAL_MEM_SIZE)\n"
"Max Memory Object Allocation: %s (CL_DEVICE_MAX_MEM_ALLOC_SIZE)\n"
"Cache Size: %s (CL_DEVICE_GLOBAL_MEM_CACHE_SIZE)\n"
"Cacheline Size: %s (CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE)\n"
" VERSIONS...\n"
"Device Vendor: %s\n"
"Device Version: %s\n"
"Driver Version: %s\n"
"Device OpenCL Version: %s\n",
devId,
name,
(1e-3 * Gflops),
freq,
compUnits,
maxWorkGrps,
byteConverter((unsigned long)memSize),
byteConverter((unsigned long)localMemSize),
byteConverter((unsigned long)maxAlloc),
byteConverter((unsigned long)globalCacheSize),
byteConverter((unsigned long)cacheLineSize),
deviceVendor,
deviceVersion,
driverVersion,
deviceOpenCLVersion);

if(Gflops > MaxGflops)
{
_deviceID = id[devId];
MaxGflops = Gflops;

winnerPlatform = platforms[platform];
}
}
}

// Print fastest device info (TODO: don't get name twice)
if (clGetDeviceInfo(_deviceID, CL_DEVICE_NAME, sizeof(name), name, NULL)) Fatal("Cannot get OpenCL device name\n");
printf("\n Selected Fastest Open CL Device: %s (#%lu)\n", name, (unsigned long)_deviceID);

// Check thread count
size_t mwgs;
if (clGetDeviceInfo(_deviceID, CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(mwgs), &mwgs, NULL))
Fatal("Cannot get OpenCL max work group size\n");

// Create OpenCL context for fastest device
cl_context_properties cps[3] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties)winnerPlatform,
(cl_context_properties)0
};
_context = clCreateContextFromType(cps, deviceType, NULL, NULL, &err);
if (!_context || err) Fatal("Cannot create OpenCL Context\n");

// Properties for create command queue; currently nothing
// cl_command_queue_properties *propers;
cl_command_queue_properties prop = 0;
//prop |= CL_QUEUE_PROFILING_ENABLE;
//prop |= CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE;
// propers = &prop;

_queue = clCreateCommandQueueWithProperties(_context, _deviceID, &prop, &err); // Create OpenCL command queue for fastest device
// _queue = clCreateCommandQueue(_context, _deviceID, &prop, &err);
if (!_queue || err) {
if (err == CL_INVALID_CONTEXT) Fatal("Cannot create OpenCL command cue: CL_INVALID_CONTEXT\n");
else if (err == CL_INVALID_DEVICE) Fatal("Cannot create OpenCL command cue: CL_INVALID_DEVICE\n");
else if (err == CL_INVALID_VALUE) Fatal("Cannot create OpenCL command cue: CL_INVALID_VALUE\n");
else if (err == CL_INVALID_QUEUE_PROPERTIES) Fatal("Cannot create OpenCL command cue: CL_INVALID_QUEUE_PROPERTIES\n");
else if (err == CL_OUT_OF_RESOURCES) Fatal("Cannot create OpenCL command cue: CL_OUT_OF_RESOURCES\n");
else if (err == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot create OpenCL command cue: CL_OUT_OF_HOST_MEMORY\n");
else if (!_queue) Fatal("Cannot create OpenCL command cue: !queue\n");
else Fatal("Cannot create OpenCL command cue: ?????\n");
}

if (_VERBOSE) printf("Init complete.\n");

return mwgs;
}

/*
* Modify the source text to fit this run.
*/
char* ModifySourceText(unsigned int nTracesIn, unsigned int samplesPerT) {

size_t sourceSize = strlen(M_AND_S_OPENCL_SOURCE_TEXT) + 1;
char* moveStackSourceCode = new char[sourceSize];
strncpy(moveStackSourceCode, M_AND_S_OPENCL_SOURCE_TEXT, sourceSize);
moveStackSourceCode[sourceSize] = '\0';

// Print out the locations of the characters where we should insert other text if asked to do so
if (_FIND_INSERT_LOCATIONS) {
size_t z;
for (z = 0; z < sourceSize; z++) {
if (moveStackSourceCode[z] == '@') {
printf("Found @ at position %zu\n", z);
break;
}
}
for (z = 0; z < sourceSize; z++) {
if (moveStackSourceCode[z] == '#') {
printf("Found # at position %zu\n", z);
break;
}
}
}

// Insert the digit that for loops go to inside of the source
PrintIntInStr(moveStackSourceCode, INSERT_LOCATION_1, nTracesIn);
PrintIntInStr(moveStackSourceCode, INSERT_LOCATION_2, samplesPerT);

// Print the modified source code if verbose
if (_FIND_INSERT_LOCATIONS) {
printf("\n GPU Source Code: \n");
printf("%s\n", moveStackSourceCode);
}

return moveStackSourceCode;
}

/*
* Wait for event and then release it.
*/
static void WaitForEventAndRelease(cl_event *event) {

printf("WaitForEventAndRelease()\n");

cl_int status = CL_SUCCESS;

status = clWaitForEvents(1, event);
if (status) Fatal("clWaitForEvents Failed with Error Code");

printf("About to release event...\n");

status = clReleaseEvent(*event);
if (status) Fatal("clReleaseEvent Failed with Error Code");
}


// Runs the program via open CL
static double RunOpenCL(float prestackTracesArray[], float stackTracesOut1DArray[], float powerTracesOut1DArray[],
unsigned int nTracesOut, unsigned int nTracesIn, unsigned int samplesPerT,
size_t inXsamples, size_t outXsamples,
unsigned int localThreadCount)
{

cl_int err;

// Get the source code
char* modifiedGpuSource = ModifySourceText(nTracesIn, samplesPerT);

// Allocate device memory
// CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR | CL_MEM_USE_PERSISTENT_MEM_AMD (?)
// Input...
cl_mem prestackTracesCL = clCreateBuffer(_context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
inXsamples * sizeof(cl_float), prestackTracesArray, &err);
if (err) FatalBufferCreation("Prestack traces", err);
// Output... TODO: How do we know that the output is zeroed out?
cl_mem stackTracesOutCL = clCreateBuffer(_context, CL_MEM_WRITE_ONLY,
outXsamples * sizeof(cl_float), NULL, &err);
if (err) FatalBufferCreation("Stack traces", err);
cl_mem powerTracesOutCL = clCreateBuffer(_context, CL_MEM_WRITE_ONLY,
outXsamples * sizeof(cl_float), NULL, &err);
if (err) FatalBufferCreation("Power traces", err);

// Compile the source code
char* gpuSourceText[1];
gpuSourceText[0] = modifiedGpuSource;
size_t sourceLength[1];
sourceLength[0] = strlen(modifiedGpuSource);
cl_program moveoutAndStackCLProgram = clCreateProgramWithSource(_context, 1, (const char**)gpuSourceText,
(const size_t*)sourceLength, &err);
if (err != CL_SUCCESS) {
if (err == CL_INVALID_CONTEXT) Fatal("Cannot create program: CL_INVALID_CONTEXT\n");
else if (err == CL_INVALID_VALUE) Fatal("Cannot create program: CL_INVALID_VALUE\n");
else if (err == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot create program: CL_OUT_OF_HOST_MEMORY\n");
else Fatal("Cannot create program_S %d\n", err);
}

// Build the program
cl_int buildCode = clBuildProgram(moveoutAndStackCLProgram, 0, NULL, NULL, NULL, NULL);
if (buildCode != CL_SUCCESS) {
// Attempt to get compile errors
char log[1048576];
if (clGetProgramBuildInfo(moveoutAndStackCLProgram, _deviceID, CL_PROGRAM_BUILD_LOG, sizeof(log), log, NULL)) {
log[0] = '\0'; // Failed to get the log file
}

if (buildCode == CL_INVALID_PROGRAM) Fatal("Cannot build program: CL_INVALID_PROGRAM\n%s", log);
else if (buildCode == CL_INVALID_VALUE) Fatal("Cannot build program: CL_INVALID_VALUE\n%s", log);
else if (buildCode == CL_INVALID_DEVICE) Fatal("Cannot build program: CL_INVALID_DEVICE\n%s", log);
else if (buildCode == CL_INVALID_BINARY) Fatal("Cannot build program: CL_INVALID_BINARY\n%s", log);
else if (buildCode == CL_INVALID_BUILD_OPTIONS) Fatal("Cannot build program: CL_INVALID_BUILD\n_OPTIONS\n%s", log);
else if (buildCode == CL_INVALID_OPERATION) Fatal("Cannot build program: CL_INVALID_OPERATION\n%s", log);
else if (buildCode == CL_COMPILER_NOT_AVAILABLE) Fatal("Cannot build program: CL_COMPILER_NOT_AVAILABLE\n%s", log);
else if (buildCode == CL_BUILD_PROGRAM_FAILURE) Fatal("Cannot build program: CL_BUILD_PROGRAM_FAILURE\n%s", log);
else if (buildCode == CL_INVALID_OPERATION) Fatal("Cannot build program: CL_INVALID_OPERATION\n%s", log);
else if (buildCode == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot build program: CL_OUT_OF_HOST_MEMORY\n%s", log);
else Fatal("Cannot build program: %d\n%s", buildCode, log);
}

// Compile the source code & build the kernel
cl_kernel kernel = clCreateKernel(moveoutAndStackCLProgram, "sumAllCL", &err);
if (err) {
if (err == CL_INVALID_PROGRAM) Fatal("Cannot create kernel: CL_INVALID_PROGRAM\n");
else if (err == CL_INVALID_PROGRAM_EXECUTABLE) Fatal("Cannot create kernel: CL_INVALID_PROGRAM_EXECUTABLE\n");
else if (err == CL_INVALID_KERNEL_NAME) Fatal("Cannot create kernel: CL_INVALID_KERNEL_NAME\n");
else if (err == CL_INVALID_KERNEL_DEFINITION) Fatal("Cannot create kernel: CL_INVALID_KERNEL_DEFINITION\n");
else if (err == CL_INVALID_VALUE) Fatal("Cannot create kernel: CL_INVALID_VALUE\n");
else if (err == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot create kernel: CL_OUT_OF_HOST_MEMOR\n");
else Fatal("Cannot create kernel: %d\n", err);
}

// Set program parameters
cl_int returnValArgSet;
returnValArgSet = clSetKernelArg(kernel, 0, sizeof(cl_mem), &prestackTracesCL);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("prestackTracesCL", returnValArgSet);
returnValArgSet = clSetKernelArg(kernel, 1, sizeof(cl_mem), &stackTracesOutCL);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("stackTracesOutCL", returnValArgSet);
returnValArgSet = clSetKernelArg(kernel, 2, sizeof(cl_mem), &powerTracesOutCL);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("powerTracesOutCL", returnValArgSet);
returnValArgSet = clSetKernelArg(kernel, 3, sizeof(unsigned int), &nTracesOut);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("nTracesOut", returnValArgSet);
returnValArgSet = clSetKernelArg(kernel, 4, sizeof(unsigned int), &nTracesIn);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("nTracesIn", returnValArgSet);
returnValArgSet = clSetKernelArg(kernel, 5, sizeof(unsigned int), &samplesPerT);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("samplesPerT", returnValArgSet);

// TODO: verbose
printf("About to run Kernel...\n");

// Start timer TODO: move?
double runTime = GetTime();

// Run the kernel (& also set the number of threads)
cl_event runEvent;
size_t Global[1] = { nTracesOut };
size_t Local[1] = { localThreadCount };
if (localThreadCount > 0) err = clEnqueueNDRangeKernel(_queue, kernel, 1, NULL, Global, Local, 0, NULL, &runEvent);
else err = clEnqueueNDRangeKernel(_queue, kernel, 1, NULL, Global, NULL, 0, NULL, &runEvent);
if (err) {
if (err == CL_INVALID_PROGRAM_EXECUTABLE) {
Fatal("Cannot run Kernel: No successfully built program executable available.\n");
} else if (err == CL_INVALID_COMMAND_QUEUE) {
Fatal("Cannot run Kernel: Command_queue is not a valid command-queue.\n");
} else if (err == CL_INVALID_KERNEL) {
Fatal("Cannot run Kernel: Kernel is not a valid kernel object.\n");
} else if (err == CL_INVALID_CONTEXT) {
Fatal("Cannot run Kernel: Context associated with command_queue and kernel is not the same or if "
"the context associated with command_queue and events in event_wait_list are not the same.\n");
} else if (err == CL_INVALID_KERNEL_ARGS) {
Fatal("Cannot run Kernel: Kernel argument values have not been specified.\n");
} else if (err == CL_INVALID_WORK_DIMENSION) {
Fatal("Cannot run Kernel: work_dim is not a valid value (must be between 1 and 3).\n");
} else if (err == CL_INVALID_WORK_GROUP_SIZE) {
Fatal("Cannot run Kernel: local_work_size is specified and number of work-items specified by global_work_size "
"is not evenly divisable by size of work-group given by local_work_size or does not match the "
"work-group size specified for kernel using the __attribute__((reqd_work_group_size(X, Y, Z))) "
"qualifier in program source.\n");
} else if (err == CL_INVALID_WORK_ITEM_SIZE) {
Fatal("Cannot run Kernel: If the number of work-items specified in any of local_work_size[0], ... "
"local_work_size[work_dim - 1] is greater than the corresponding values specified "
"by CL_DEVICE_MAX_WORK_ITEM_SIZES[0], .... CL_DEVICE_MAX_WORK_ITEM_SIZES[work_dim - 1]. .\n");
} else if (err == CL_INVALID_GLOBAL_OFFSET) {
Fatal("Cannot run Kernel: Global_work_offset is not NULL.\n");
} else if (err == CL_OUT_OF_RESOURCES) {
Fatal("Cannot run Kernel: CL_OUT_OF_RESOURCES.\n");
} else if (err == CL_MEM_OBJECT_ALLOCATION_FAILURE) {
Fatal("Cannot run Kernel: Failure to allocate memory for data store associated with image or buffer "
"objects specified as arguments to kernel.\n");
} else if (err == CL_INVALID_EVENT_WAIT_LIST) {
Fatal("Cannot run Kernel: event_wait_list is NULL and num_events_in_wait_list > 0, or event_wait_list "
"is not NULL and num_events_in_wait_list is 0, or if event objects in event_wait_list "
"are not valid events..\n");
} else if (err == CL_OUT_OF_HOST_MEMORY) {
Fatal("Cannot run Kernel: Failure to allocate resources required by the OpenCL implementation on the host.\n");
} else {
Fatal("Cannot run Kernel: Unknown Error. (clEnqueueNDRangeKernel)");
}
}

// Flush the program & wait for the program to finish executing
if (clFlush(_queue)) printf("Flush Fail (Run)");
WaitForEventAndRelease(&runEvent);

// Copy the end result back to CPU memory side
if (clEnqueueReadBuffer(_queue, stackTracesOutCL, CL_TRUE, 0, outXsamples * sizeof(cl_float), stackTracesOut1DArray, 0, NULL, NULL))
Fatal("Cannot copy stackTracesOutCL from device to host\n");
if (clEnqueueReadBuffer(_queue, powerTracesOutCL, CL_TRUE, 0, outXsamples * sizeof(cl_float), powerTracesOut1DArray, 0, NULL, NULL))
Fatal("Cannot copy powerTracesOutCL from device to host\n");

// Release kernel and program
if (clReleaseKernel(kernel)) Fatal("Cannot release kernel\n");
if (clReleaseProgram(moveoutAndStackCLProgram)) Fatal("Cannot release program\n");

// Free device memory
clReleaseMemObject(prestackTracesCL);
clReleaseMemObject(stackTracesOutCL);
clReleaseMemObject(powerTracesOutCL);

// Release the context and queue
clReleaseCommandQueue(_queue);
clReleaseContext(_context);

// Return the time it took to run this program
return runTime;
}

double RunProg(unsigned int samplesPerTrace, unsigned int nTracesIn, unsigned int nTracesOut,
unsigned int localThreadCount, unsigned int deviceType) {

// Stores sizes of the various arrays
size_t tracesInxSample = nTracesIn * samplesPerTrace;
size_t tracesOutxSample = nTracesOut * samplesPerTrace;

// Allocate arrays
float* prestackTraces1D = (float*)malloc(tracesInxSample * sizeof(float));
float* stackTracesOut1Dgpu = (float*)calloc(tracesOutxSample, sizeof(float)); // output; zero-out
float* powerTracesOut1Dgpu = (float*)calloc(tracesOutxSample, sizeof(float)); // output; zero-out

// Count how much memory all of this is
if (_VERBOSE)
{
// Make sure it is consistent with above allocation
unsigned long allocatedMemory = 0;
allocatedMemory += tracesInxSample * sizeof(float);
allocatedMemory += tracesOutxSample * sizeof(float);
allocatedMemory += tracesOutxSample * sizeof(float);

printf("TOTAL MEMORY ALLOCATED: %s\n", byteConverter(allocatedMemory));
printf("Input Array Sizes: %s\n", byteConverter((unsigned int)(tracesInxSample * sizeof(float))));
printf("Output Array Sizes: %s\n", byteConverter((unsigned int)(tracesOutxSample * sizeof(float))));
}

// Fill in array with randoms
RandomFillArray(prestackTraces1D, (unsigned int)tracesInxSample);

// Init OpenCL using the desired device type
double preInitTime = GetTime();
int maxWorkGroupSize;
if (deviceType == 0) maxWorkGroupSize = InitOpenCL(_VERBOSE, CL_DEVICE_TYPE_ALL);
else if (deviceType == 1) maxWorkGroupSize = InitOpenCL(_VERBOSE, CL_DEVICE_TYPE_GPU);
else maxWorkGroupSize = InitOpenCL(_VERBOSE, CL_DEVICE_TYPE_CPU);
printf("Max work size for the device is: %d\n", maxWorkGroupSize);

// --- ACTUAL TEST ---
// Run OpenCL
double startTime = GetTime();
double runTime = RunOpenCL(prestackTraces1D, stackTracesOut1Dgpu, powerTracesOut1Dgpu, // arrays
nTracesOut, nTracesIn, samplesPerTrace, // ints
tracesInxSample, tracesOutxSample,
localThreadCount); // samples

// Display run time
double endTime = GetTime();
printf("Elapsed Time: %fsecs\n", (endTime - runTime));
printf(" %fsecs (Before Function Call)\n", (endTime - startTime));
printf(" %fsecs (Including Init)\n\n", (endTime - preInitTime));

// Free the 1D arrays
free(powerTracesOut1Dgpu);
free(stackTracesOut1Dgpu);
free(prestackTraces1D);

return (endTime - startTime);
}

关于为什么它在我的 GPU 上运行比我的 CPU 慢得多,我的第一个想法是可能是因为我在运行任何东西之前在显卡上传输了太多数据。或许更好的实现方式是将工作负载拆分为多次运行,这样代码就可以在处理更多数据的同时执行(我认为这是一回事)。然而,现在我想起来这几乎肯定是错误的,因为正如我所说的,我根据一个例子编写了这个程序,那个例子做了矩阵乘法,这个例子在 GPU 上的运行速度比我的 CPU 快得多。我真的不知道有什么区别。

最佳答案

问题是缓存;我经常从数组中读写。所以我做了一个版本,在写入数组之前尽可能多地写入局部变量,现在它在 GPU 上运行得更快。

关于c++ - 为什么我的程序在我的 CPU 设备上运行速度明显快于我的 GPU 设备?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39968249/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com