gpt4 book ai didi

cuda - GPU(CUDA)中的Kmeans聚类加速

转载 作者:行者123 更新时间:2023-12-02 09:12:13 44 4
gpt4 key购买 nike

我是一个相当新的 cuda 用户。我正在练习我的第一个 cuda 应用程序,尝试使用 GPU(GTX 670) 加速 kmeans 算法。

简单地说,每个线程都在一个点上工作,该点与所有聚类中心进行比较,并将一个点分配给具有最小距离的中心(可以在下面看到内核代码和注释)。

根据 Nsight Visual Studio,我的占用率为 99.61%(1024 个 block ,每个 block 1024 个线程),流式多处理器事件为 99.34%,扭曲问题效率为 79.98%,无共享内存库冲突,单个 MUL 为 18.4GFLOPs,单 MUL 为 55.2 GFLOPs Single ADD(使用给定参数完成 kmeans 内核大约需要 14.5 毫秒)。

根据维基百科,GTX670 的峰值性能为 2460 GFLOPs。我离它还差得很远。除此之外,一些论文声称它们可以达到峰值性能的一半以上。我不知道如何进一步优化这个内核代码。我可以对内核进行任何优化吗?如有任何建议或帮助,我们将不胜感激,我可以根据需要提供任何其他信息。

Complete Code

提前致谢。

#define SIZE 1024*1024 //number of points
#define CENTERS 32 //number of cluster centroids
#define DIM 8 //dimension of each point and center
#define cudaTHREADSIZE 1024 //threads per block
#define cudaBLOCKSIZE SIZE/cudaTHREADSIZE //number of blocks for kernel

__global__ void kMeans(float *dp, float *dc,int *tag, int *membershipChangedPerBlock)
{
//TOTAL NUMBER OF THREADS SHOULD BE EQUAL TO THE NUMBER OF POINTS, BECAUSE EACH THREAD WORKS ON A SINGLE POINT
__shared__ unsigned char membershipChanged[cudaTHREADSIZE];
__shared__ float dc_shared[CENTERS*DIM];

int tid = threadIdx.x + blockIdx.x * blockDim.x;
int threadID = threadIdx.x;

membershipChanged[threadIdx.x] = 0;
//move centers to shared memory, because each and every thread will call it(roughly + %10 performance here)
while(threadID < CENTERS*DIM){
dc_shared[threadID] = dc[threadID];

threadID += blockDim.x;
}
__syncthreads();

while(tid < SIZE){
int index,prevIndex;
float dist, min_dist;

index = 0;//all initial point indices(centroid number) are assigned to 0.
prevIndex = 0;
dist = 0;
min_dist = 0;

//euclid distance for center 0
for(int dimIdx = 0; dimIdx < DIM; dimIdx++){
min_dist += (dp[tid + dimIdx*SIZE] - dc_shared[dimIdx*CENTERS])*(dp[tid + dimIdx*SIZE] - dc_shared[dimIdx*CENTERS]);
}

//euclid distance for other centers with distance comparison
for(int centerIdx = 1; centerIdx < CENTERS; centerIdx++){
dist = 0;
for(int dimIdx = 0; dimIdx < DIM; dimIdx++){
dist += (dp[tid + dimIdx*SIZE] - dc_shared[centerIdx + dimIdx*CENTERS])*(dp[tid + dimIdx*SIZE] - dc_shared[centerIdx + dimIdx*CENTERS]);
}
//compare distances, if found a shorter one, change index to that centroid number
if(dist < min_dist){
min_dist = dist;
index = centerIdx;
}
}

if (tag[tid] != index) {//if a point's cluster membership changes, flag it as changed in order to compute total membership changes later on
membershipChanged[threadIdx.x] = 1;
}
tag[tid] = index;

__syncthreads();//sync before applying sum reduction to membership changes


//sum reduction
for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
if (threadIdx.x < s) {
membershipChanged[threadIdx.x] +=
membershipChanged[threadIdx.x + s];
}
__syncthreads();
}

if (threadIdx.x == 0) {
membershipChangedPerBlock[blockIdx.x] = membershipChanged[0];
}
tid += blockDim.x * gridDim.x;
}
}

最佳答案

我的建议是将您的工作与更有经验的 GPU 开发人员的工作进行比较。看了这个video后我发现Kmeans算法是Byran Catanzaro写的。您可以找到源代码:

https://github.com/bryancatanzaro/kmeans

我也是初学者,但恕我直言,最好使用“Trust”等库。 GPU 编程确实是一个复杂的问题,很难实现最大性能“信任”会帮助您。

关于cuda - GPU(CUDA)中的Kmeans聚类加速,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29187479/

44 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com