gpt4 book ai didi

c - 使用矩阵算法和常量进行嵌套 for 循环调试。

转载 作者:行者123 更新时间:2023-11-30 14:27:15 25 4
gpt4 key购买 nike

这组嵌套 for 循环对于 M=64 和 N=64 的值可以正常工作,但当我使 M=128 和 N=64 时不起作用。我有另一个程序来检查矩阵乘法的正确值。直觉上它似乎仍然有效,但给了我错误的答案。

for(int m=64;m<=M;m+=64){
for(int n=64;n<=N;n+=64){
for(int i = m-64; i < m; i+=16){

float *A_column_start, *C_column_start;
__m128 c_1, c_2, c_3, c_4, a_1, a_2, a_3, a_4, mul_1,
mul_2, mul_3, mul_4, b_1;
int j, k;

for(j = m-64; j < m; j++){

//Load 16 contiguous column aligned elements from matrix C in
//c_1-c_4 registers

C_column_start = C+i+j*M;

c_1 = _mm_loadu_ps(C_column_start);
c_2 = _mm_loadu_ps(C_column_start+4);
c_3 = _mm_loadu_ps(C_column_start+8);
c_4 = _mm_loadu_ps(C_column_start+12);

for (k=n-64; k < n; k+=2){

//Load 16 contiguous column aligned elements from matrix A to
//the a_1-a_4 registers

A_column_start = A+k*M;

a_1 = _mm_loadu_ps(A_column_start+i);
a_2 = _mm_loadu_ps(A_column_start+i+4);
a_3 = _mm_loadu_ps(A_column_start+i+8);
a_4 = _mm_loadu_ps(A_column_start+i+12);

//Load a value to resgister b_1 to act as a "B" or ("A^T")
//element to multiply against the A matrix

b_1 = _mm_load1_ps(A_column_start+j);

mul_1 = _mm_mul_ps(a_1, b_1);
mul_2 = _mm_mul_ps(a_2, b_1);
mul_3 = _mm_mul_ps(a_3, b_1);
mul_4 = _mm_mul_ps(a_4, b_1);

//Add together all values of the multiplied A and "B"
//(or "A^T") matrix elements

c_4 = _mm_add_ps(c_4, mul_4);
c_3 = _mm_add_ps(c_3, mul_3);
c_2 = _mm_add_ps(c_2, mul_2);
c_1 = _mm_add_ps(c_1, mul_1);

//Move over one column in A, and load the next 16 contiguous
//column aligned elements from matrix A to the a_1-a_4 registers

A_column_start+=M;

a_1 = _mm_loadu_ps(A_column_start+i);
a_2 = _mm_loadu_ps(A_column_start+i+4);
a_3 = _mm_loadu_ps(A_column_start+i+8);
a_4 = _mm_loadu_ps(A_column_start+i+12);

//Load a value to resgister b_1 to act as a "B" or "A^T"
//element to multiply against the A matrix

b_1 = _mm_load1_ps(A_column_start+j);

mul_1 = _mm_mul_ps(a_1, b_1);
mul_2 = _mm_mul_ps(a_2, b_1);
mul_3 = _mm_mul_ps(a_3, b_1);
mul_4 = _mm_mul_ps(a_4, b_1);

//Add together all values of the multiplied A and "B" or
//("A^T") matrix elements

c_4 = _mm_add_ps(c_4, mul_4);
c_3 = _mm_add_ps(c_3, mul_3);
c_2 = _mm_add_ps(c_2, mul_2);
c_1 = _mm_add_ps(c_1, mul_1);

}
//Store the added up C values back to memory

_mm_storeu_ps(C_column_start, c_1);
_mm_storeu_ps(C_column_start+4, c_2);
_mm_storeu_ps(C_column_start+8, c_3);
_mm_storeu_ps(C_column_start+12, c_4);

}

}
}
}}

最佳答案

我猜你在代码中使用了M

C_column_start = C+i+j*M;

需要使用m来代替。也可能在其他使用 M 的行中。但是,我不太理解您的代码,因为您没有解释代码的用途,而且我不是数学程序员。

关于c - 使用矩阵算法和常量进行嵌套 for 循环调试。,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8077037/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com