gpt4 book ai didi

cuda - 分析我的 CUDA 内核的内存访问合并

转载 作者:行者123 更新时间:2023-12-04 14:31:02 31 4
gpt4 key购买 nike

我想通过 BS_x*BS_Y 线程将内容移动到共享内存来读取 (BS_X+1)*(BS_Y+1) 全局内存位置,并且我开发了以下代码。

int i       = threadIdx.x;
int j = threadIdx.y;
int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;

int index1 = j*BLOCK_SIZE_Y+i;

int i1 = (index1)%(BLOCK_SIZE_X+1);
int j1 = (index1)/(BLOCK_SIZE_Y+1);

int i2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);

__shared__ double Ezx_h_shared_ext[BLOCK_SIZE_X+1][BLOCK_SIZE_Y+1];

Ezx_h_shared_ext[i1][j1]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];

if ((i2<(BLOCK_SIZE_X+1))&&(j2<(BLOCK_SIZE_Y+1)))
Ezx_h_shared_ext[i2][j2]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j2)*xdim+(blockIdx.x*BLOCK_SIZE_X+i2)];

在我的理解中,合并是顺序处理的连续内存读取的并行等价物。我现在如何检测全局内存访问是否已合并?我注意到有一个从 (i1,j1) 到 (i2,j2) 的索引跳转。
提前致谢。

最佳答案

我已经使用手写合并分析器评估了您的代码的内存访问。评估显示代码较少利用合并。这是您可能会发现有用的合并分析器:

#include <stdio.h>
#include <malloc.h>

typedef struct dim3_t{
int x;
int y;
} dim3;


// KERNEL LAUNCH PARAMETERS
#define GRIDDIMX 4
#define GRIDDIMY 4
#define BLOCKDIMX 16
#define BLOCKDIMY 16


// ARCHITECTURE DEPENDENT
// number of threads aggregated for coalescing
#define COALESCINGWIDTH 32
// number of bytes in one coalesced transaction
#define CACHEBLOCKSIZE 128
#define CACHE_BLOCK_ADDR(addr,size) (addr*size)&(~(CACHEBLOCKSIZE-1))


int main(){
// fixed dim3 variables
// grid and block size
dim3 blockDim,gridDim;
blockDim.x=BLOCKDIMX;
blockDim.y=BLOCKDIMY;
gridDim.x=GRIDDIMX;
gridDim.y=GRIDDIMY;

// counters
int unq_accesses=0;
int *unq_addr=(int*)malloc(sizeof(int)*COALESCINGWIDTH);
int total_unq_accesses=0;

// iter over total number of threads
// and count the number of memory requests (the coalesced requests)
int I, II, III;
for(I=0; I<GRIDDIMX*GRIDDIMY; I++){
dim3 blockIdx;
blockIdx.x = I%GRIDDIMX;
blockIdx.y = I/GRIDDIMX;
for(II=0; II<BLOCKDIMX*BLOCKDIMY; II++){
if(II%COALESCINGWIDTH==0){
// new coalescing bunch
total_unq_accesses+=unq_accesses;
unq_accesses=0;
}
dim3 threadIdx;
threadIdx.x=II%BLOCKDIMX;
threadIdx.y=II/BLOCKDIMX;

////////////////////////////////////////////////////////
// Change this section to evaluate different accesses //
////////////////////////////////////////////////////////
// do your indexing here
#define BLOCK_SIZE_X BLOCKDIMX
#define BLOCK_SIZE_Y BLOCKDIMY
#define xdim 32
int i = threadIdx.x;
int j = threadIdx.y;
int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;

int index1 = j*BLOCK_SIZE_Y+i;

int i1 = (index1)%(BLOCK_SIZE_X+1);
int j1 = (index1)/(BLOCK_SIZE_Y+1);

int i2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);
// calculate the accessed location and offset here
// change the line "Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];" to
int addr = (blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1);
int size = sizeof(double);
//////////////////////////
// End of modifications //
//////////////////////////

printf("tid (%d,%d) from blockid (%d,%d) accessing to block %d\n",threadIdx.x,threadIdx.y,blockIdx.x,blockIdx.y,CACHE_BLOCK_ADDR(addr,size));
// check whether it can be merged with existing requests or not
short merged=0;
for(III=0; III<unq_accesses; III++){
if(CACHE_BLOCK_ADDR(addr,size)==CACHE_BLOCK_ADDR(unq_addr[III],size)){
merged=1;
break;
}
}
if(!merged){
// new cache block accessed over this coalescing width
unq_addr[unq_accesses]=CACHE_BLOCK_ADDR(addr,size);
unq_accesses++;
}
}
}
printf("%d threads make %d memory transactions\n",GRIDDIMX*GRIDDIMY*BLOCKDIMX*BLOCKDIMY, total_unq_accesses);
}

该代码将为网格的每个线程运行并计算合并请求的数量、内存访问合并的度量。

要使用分析器,请将代码的索引计算部分粘贴到指定区域,并将内存访问(数组)分解为“地址”和“大小”。我已经为你的代码做了这个,索引是:
int i       = threadIdx.x;
int j = threadIdx.y;
int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;

int index1 = j*BLOCK_SIZE_Y+i;

int i1 = (index1)%(BLOCK_SIZE_X+1);
int j1 = (index1)/(BLOCK_SIZE_Y+1);

int i2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);

并且内存访问是:
Ezx_h_shared_ext[i1][j1]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];

分析器报告 4096 个线程访问 4064 个缓存块。为您的实际网格和块大小运行代码并分析合并行为。

关于cuda - 分析我的 CUDA 内核的内存访问合并,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13771538/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com