gpt4 book ai didi

c++ - C/C++ 和 CUDA 中的双线性插值

转载 作者:可可西里 更新时间:2023-11-01 15:06:42 32 4
gpt4 key购买 nike

我想在CPU上模拟CUDA双线性插值的行为,但发现返回值tex2D好像不适合bilinear formula .

我猜是从 float 转换插值系数至 9 -bit 定点格式与 8小数位 [1]导致不同的值。

根据换算公式[2, line 106] ,转换的结果将与输入 float 相同当系数为 1/2^n , 与 n=0,1,..., 8 ,但我仍然(并非总是)收到奇怪的值。

下面我报告一个奇怪值的例子。在这种情况下,奇怪的值总是在 时发生。 id = 2*n+1 ,谁能告诉我为什么?

源数组:

Src[0][0] =  38;  
Src[1][0] = 39;
Src[0][1] = 118;
Src[1][1] = 13;

纹理定义:
static texture<float4, 2, cudaReadModeElementType> texElnt;
texElnt.addressMode[0] = cudaAddressModeClamp;
texElnt.addressMode[1] = cudaAddressModeClamp;
texElnt.filterMode = cudaFilterModeLinear;
texElnt.normalized = false;

内核函数:
static __global__ void kernel_texElnt(float* pdata, int w, int h, int c, float stride/*0.03125f*/) {
const int gx = blockIdx.x*blockDim.x + threadIdx.x;
const int gy = blockIdx.y*blockDim.y + threadIdx.y;
const int gw = gridDim.x * blockDim.x;
const int gid = gy*gw + gx;
if (gx >= w || gy >= h) {
return;
}

float2 pnt;
pnt.x = (gx)*(stride)/*1/32*/;
pnt.y = 0.0625f/*1/16*/;

float4 result = tex2D( texElnt, pnt.x + 0.5, pnt.y + 0.5f);
pdata[gid*3 + 0] = pnt.x;
pdata[gid*3 + 1] = pnt.y;
pdata[gid*3 + 2] = result.x;

}

CUDA 双线性结果
id  pnt.x   pnt.y   tex2D
0 0.00000 0.0625 43.0000000
1 0.03125 0.0625 42.6171875
2 0.06250 0.0625 42.6484375
3 0.09375 0.0625 42.2656250
4 0.12500 0.0625 42.2968750
5 0.15625 0.0625 41.9140625
6 0.18750 0.0625 41.9453125
7 0.21875 0.0625 41.5625000
8 0.25000 0.0625 41.5937500
9 0.28125 0.0625 41.2109375
0 0.31250 0.0625 41.2421875
10 0.34375 0.0625 40.8593750
11 0.37500 0.0625 40.8906250
12 0.40625 0.0625 40.5078125
13 0.43750 0.0625 40.5390625
14 0.46875 0.0625 40.1562500
15 0.50000 0.0625 40.1875000
16 0.53125 0.0625 39.8046875
17 0.56250 0.0625 39.8359375
18 0.59375 0.0625 39.4531250
19 0.62500 0.0625 39.4843750
20 0.65625 0.0625 39.1015625
21 0.68750 0.0625 39.1328125
22 0.71875 0.0625 38.7500000
23 0.75000 0.0625 38.7812500
24 0.78125 0.0625 38.3984375
25 0.81250 0.0625 38.4296875
26 0.84375 0.0625 38.0468750
27 0.87500 0.0625 38.0781250
28 0.90625 0.0625 37.6953125
29 0.93750 0.0625 37.7265625
30 0.96875 0.0625 37.3437500
31 1.00000 0.0625 37.3750000

CPU 结果:
// convert coefficient ((1-α)*(1-β)), (α*(1-β)), ((1-α)*β), (α*β) to fixed point format  

id pnt.x pnt.y tex2D
0 0.00000 0.0625 43.00000000
1 0.03125 0.0625 43.23046875
2 0.06250 0.0625 42.64843750
3 0.09375 0.0625 42.87890625
4 0.12500 0.0625 42.29687500
5 0.15625 0.0625 42.52734375
6 0.18750 0.0625 41.94531250
7 0.21875 0.0625 42.17578125
8 0.25000 0.0625 41.59375000
9 0.28125 0.0625 41.82421875
0 0.31250 0.0625 41.24218750
10 0.34375 0.0625 41.47265625
11 0.37500 0.0625 40.89062500
12 0.40625 0.0625 41.12109375
13 0.43750 0.0625 40.53906250
14 0.46875 0.0625 40.76953125
15 0.50000 0.0625 40.18750000
16 0.53125 0.0625 40.41796875
17 0.56250 0.0625 39.83593750
18 0.59375 0.0625 40.06640625
19 0.62500 0.0625 39.48437500
20 0.65625 0.0625 39.71484375
21 0.68750 0.0625 39.13281250
22 0.71875 0.0625 39.36328125
23 0.75000 0.0625 38.78125000
24 0.78125 0.0625 39.01171875
25 0.81250 0.0625 38.42968750
26 0.84375 0.0625 38.66015625
27 0.87500 0.0625 38.07812500
28 0.90625 0.0625 38.30859375
29 0.93750 0.0625 37.72656250
30 0.96875 0.0625 37.95703125
31 1.00000 0.0625 37.37500000

我在 my github [3] 上留下一个简单的代码, 运行该程序后,您将在 D:\ 中得到两个文件.

编辑 2014/01/20

我以不同的增量运行程序,找到了 tex2D 的规范 “当 alpha 乘以 beta 小于 0.00390625 时,tex2D 的返回值与双线性插值公式不匹配”​​

最佳答案

已经为这个问题提供了令人满意的答案,所以现在我只想提供一个关于双线性插值的有用信息概要,它如何在 C++ 中实现以及它可以在 CUDA 中完成的不同方式。

双线性插值背后的数学

假设原函数T(x, y)在笛卡尔规则网格点采样 (i, j)0 <= i < M1 , 0 <= j < M2ij整数。对于 y 的每个值,可以先用0 <= a < 1表示任意点 i + a包含在i之间和 i + 1 .然后,沿 y = j 进行线性插值轴(平行于 x 轴)在该点可以执行获得

enter image description here

哪里r(x,y)是对 T(x,y) 的样本进行插值的函数.对于行 y = j + 1 也可以这样做。 , 获得

enter image description here

现在,对于每个 i + a ,沿 y 的插值可以在样本上执行轴 r(i+a,j)r(i+a,j+1) .因此,如果使用 0 <= b < 1表示任意点 j + b位于 j 之间和 j + 1 ,然后沿 x = i + a 进行线性插值轴(平行于 y 轴)可以计算出来,所以得到最终结果

enter image description here

注意i之间的关系, j , a , b , xy以下是

enter image description here

C/C++ 实现

让我强调一下,这个实现以及下面的 CUDA 实现假设,正如开头所做的那样,T 的样本位于点的笛卡尔规则网格上 (i, j)0 <= i < M1 , 0 <= j < M2ij整数(单位间距)。此外,该例程以单精度、复杂 ( float2 ) 算术提供,但可以轻松转换为其他感兴趣的算术。

void bilinear_interpolation_function_CPU(float2 * __restrict__ h_result, float2 * __restrict__ h_data, 
float * __restrict__ h_xout, float * __restrict__ h_yout,
const int M1, const int M2, const int N1, const int N2){

float2 result_temp1, result_temp2;
for(int k=0; k<N2; k++){
for(int l=0; l<N1; l++){

const int ind_x = floor(h_xout[k*N1+l]);
const float a = h_xout[k*N1+l]-ind_x;

const int ind_y = floor(h_yout[k*N1+l]);
const float b = h_yout[k*N1+l]-ind_y;

float2 h00, h01, h10, h11;
if (((ind_x) < M1)&&((ind_y) < M2)) h00 = h_data[ind_y*M1+ind_x]; else h00 = make_float2(0.f, 0.f);
if (((ind_x+1) < M1)&&((ind_y) < M2)) h10 = h_data[ind_y*M1+ind_x+1]; else h10 = make_float2(0.f, 0.f);
if (((ind_x) < M1)&&((ind_y+1) < M2)) h01 = h_data[(ind_y+1)*M1+ind_x]; else h01 = make_float2(0.f, 0.f);
if (((ind_x+1) < M1)&&((ind_y+1) < M2)) h11 = h_data[(ind_y+1)*M1+ind_x+1]; else h11 = make_float2(0.f, 0.f);

result_temp1.x = a * h10.x + (-h00.x * a + h00.x);
result_temp1.y = a * h10.y + (-h00.y * a + h00.y);

result_temp2.x = a * h11.x + (-h01.x * a + h01.x);
result_temp2.y = a * h11.y + (-h01.y * a + h01.y);

h_result[k*N1+l].x = b * result_temp2.x + (-result_temp1.x * b + result_temp1.x);
h_result[k*N1+l].y = b * result_temp2.y + (-result_temp1.y * b + result_temp1.y);

}
}
}
if/else上面代码中的语句只是边界检查。如果样本落在 [0, M1-1] x [0, M2-1] 之外,则设置为 0 .

标准 CUDA 实现

这是跟踪上述 CPU 的“标准”CUDA 实现。不使用纹理内存。
__global__ void bilinear_interpolation_kernel_GPU(float2 * __restrict__ d_result, const float2 * __restrict__ d_data, 
const float * __restrict__ d_xout, const float * __restrict__ d_yout,
const int M1, const int M2, const int N1, const int N2)
{
const int l = threadIdx.x + blockDim.x * blockIdx.x;
const int k = threadIdx.y + blockDim.y * blockIdx.y;

if ((l<N1)&&(k<N2)) {

float2 result_temp1, result_temp2;

const int ind_x = floor(d_xout[k*N1+l]);
const float a = d_xout[k*N1+l]-ind_x;

const int ind_y = floor(d_yout[k*N1+l]);
const float b = d_yout[k*N1+l]-ind_y;

float2 d00, d01, d10, d11;
if (((ind_x) < M1)&&((ind_y) < M2)) d00 = d_data[ind_y*M1+ind_x]; else d00 = make_float2(0.f, 0.f);
if (((ind_x+1) < M1)&&((ind_y) < M2)) d10 = d_data[ind_y*M1+ind_x+1]; else d10 = make_float2(0.f, 0.f);
if (((ind_x) < M1)&&((ind_y+1) < M2)) d01 = d_data[(ind_y+1)*M1+ind_x]; else d01 = make_float2(0.f, 0.f);
if (((ind_x+1) < M1)&&((ind_y+1) < M2)) d11 = d_data[(ind_y+1)*M1+ind_x+1]; else d11 = make_float2(0.f, 0.f);

result_temp1.x = a * d10.x + (-d00.x * a + d00.x);
result_temp1.y = a * d10.y + (-d00.y * a + d00.y);

result_temp2.x = a * d11.x + (-d01.x * a + d01.x);
result_temp2.y = a * d11.y + (-d01.y * a + d01.y);

d_result[k*N1+l].x = b * result_temp2.x + (-result_temp1.x * b + result_temp1.x);
d_result[k*N1+l].y = b * result_temp2.y + (-result_temp1.y * b + result_temp1.y);

}
}

带有纹理获取的 CUDA 实现

这与上面的实现相同,但全局内存现在由纹理缓存访问。例如, T[i,j]被访问为
tex2D(d_texture_fetch_float,ind_x,ind_y);

(其中,当然 ind_x = iind_y = jd_texture_fetch_float 被假定为全局范围变量)而不是
d_data[ind_y*M1+ind_x];

请注意,这里没有利用硬连线纹理过滤功能。下面的例程与上面的例程具有相同的精度,并且可能比旧的 CUDA 体系结构更快。
__global__ void bilinear_interpolation_kernel_GPU_texture_fetch(float2 * __restrict__ d_result, 
const float * __restrict__ d_xout, const float * __restrict__ d_yout,
const int M1, const int M2, const int N1, const int N2)
{
const int l = threadIdx.x + blockDim.x * blockIdx.x;
const int k = threadIdx.y + blockDim.y * blockIdx.y;

if ((l<N1)&&(k<N2)) {

float2 result_temp1, result_temp2;

const int ind_x = floor(d_xout[k*N1+l]);
const float a = d_xout[k*N1+l]-ind_x;

const int ind_y = floor(d_yout[k*N1+l]);
const float b = d_yout[k*N1+l]-ind_y;

const float2 d00 = tex2D(d_texture_fetch_float,ind_x,ind_y);
const float2 d10 = tex2D(d_texture_fetch_float,ind_x+1,ind_y);
const float2 d11 = tex2D(d_texture_fetch_float,ind_x+1,ind_y+1);
const float2 d01 = tex2D(d_texture_fetch_float,ind_x,ind_y+1);

result_temp1.x = a * d10.x + (-d00.x * a + d00.x);
result_temp1.y = a * d10.y + (-d00.y * a + d00.y);

result_temp2.x = a * d11.x + (-d01.x * a + d01.x);
result_temp2.y = a * d11.y + (-d01.y * a + d01.y);

d_result[k*N1+l].x = b * result_temp2.x + (-result_temp1.x * b + result_temp1.x);
d_result[k*N1+l].y = b * result_temp2.y + (-result_temp1.y * b + result_temp1.y);

}
}

纹理绑定(bind)可以根据
void TextureBindingBilinearFetch(const float2 * __restrict__ data, const int M1, const int M2)
{
size_t pitch;
float* data_d;
gpuErrchk(cudaMallocPitch((void**)&data_d,&pitch, M1 * sizeof(float2), M2));
cudaChannelFormatDesc desc = cudaCreateChannelDesc<float2>();
gpuErrchk(cudaBindTexture2D(0,&d_texture_fetch_float,data_d,&desc,M1,M2,pitch));
d_texture_fetch_float.addressMode[0] = cudaAddressModeClamp;
d_texture_fetch_float.addressMode[1] = cudaAddressModeClamp;
gpuErrchk(cudaMemcpy2D(data_d,pitch,data,sizeof(float2)*M1,sizeof(float2)*M1,M2,cudaMemcpyHostToDevice));

}

请注意,现在我们不需要 if/else边界检查,因为纹理会自动将落在 [0, M1-1] x [0, M2-1] 之外的样本钳位为零。采样区域,感谢说明
    d_texture_fetch_float.addressMode[0] = cudaAddressModeClamp;
d_texture_fetch_float.addressMode[1] = cudaAddressModeClamp;

带有纹理插值的 CUDA 实现

这是最后一个实现,它使用了纹理过滤的硬连线功能。
__global__ void bilinear_interpolation_kernel_GPU_texture_interp(float2 * __restrict__ d_result, 
const float * __restrict__ d_xout, const float * __restrict__ d_yout,
const int M1, const int M2, const int N1, const int N2)
{
const int l = threadIdx.x + blockDim.x * blockIdx.x;
const int k = threadIdx.y + blockDim.y * blockIdx.y;

if ((l<N1)&&(k<N2)) { d_result[k*N1+l] = tex2D(d_texture_interp_float, d_xout[k*N1+l] + 0.5f, d_yout[k*N1+l] + 0.5f); }
}

注意这个特性实现的插值公式和上面推导的一样,但是现在

enter image description here

哪里 x_B = x - 0.5y_B = y - 0.5 .这解释了 0.5指令中的偏移量
tex2D(d_texture_interp_float, d_xout[k*N1+l] + 0.5f, d_yout[k*N1+l] + 0.5f)

在这种情况下,纹理绑定(bind)应该如下完成
void TextureBindingBilinearInterp(const float2 * __restrict__ data, const int M1, const int M2)
{
size_t pitch;
float* data_d;
gpuErrchk(cudaMallocPitch((void**)&data_d,&pitch, M1 * sizeof(float2), M2));
cudaChannelFormatDesc desc = cudaCreateChannelDesc<float2>();
gpuErrchk(cudaBindTexture2D(0,&d_texture_interp_float,data_d,&desc,M1,M2,pitch));
d_texture_interp_float.addressMode[0] = cudaAddressModeClamp;
d_texture_interp_float.addressMode[1] = cudaAddressModeClamp;
d_texture_interp_float.filterMode = cudaFilterModeLinear; // --- Enable linear filtering
d_texture_interp_float.normalized = false; // --- Texture coordinates will NOT be normalized
gpuErrchk(cudaMemcpy2D(data_d,pitch,data,sizeof(float2)*M1,sizeof(float2)*M1,M2,cudaMemcpyHostToDevice));

}

请注意,正如其他答案中已经提到的, ab存储在 9 -bit 定点格式与 8小数值位,因此这种方法将非常快,但不如上述方法准确。

关于c++ - C/C++ 和 CUDA 中的双线性插值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21128731/

32 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com