gpt4 book ai didi

algorithm - 二进制矩阵乘法 bit twiddling hack

转载 作者:塔克拉玛干 更新时间:2023-11-03 02:35:52 29 4
gpt4 key购买 nike

摘要

您好,假设您有两个不同的独立 64 位二进制矩阵 AT(T 是另一个存储在转置矩阵中的矩阵形式,使用矩阵的转置版本允许在乘法期间对 T 的行而不是列进行操作,这对于二进制算术来说非常酷)并且你想要将这些矩阵相乘唯一的事情就是矩阵乘法结果被截断为 64 位,如果您在某些特定矩阵单元中产生大于 1 的值,则生成的矩阵单元将包含 1,否则为 0

例子

   A        T
00000001 01111101
01010100 01100101
10010111 00010100
10110000 00011000 <-- This matrix is transposed
11000100 00111110
10000011 10101111
11110101 11000100
10100000 01100010

二进制和传统乘法结果:

 Binary  Traditional
11000100 11000100
11111111 32212121
11111111 32213421
11111111 21112211
11101111 22101231
11001111 11001311
11111111 54213432
11001111 11001211

问题

如何以最有效的方式以上述方式将这些矩阵相乘?

附言

我试图利用二进制 (即 & 运算符)而不是对单独的位执行乘法,在这种情况下,我必须为乘法准备数据:

ulong u;

u = T & 0xFF;
u = (u << 00) + (u << 08) + (u << 16) + (u << 24)
+ (u << 32) + (u << 40) + (u << 48) + (u << 56);

现在通过对两个整数 Au 执行二进制 and 它将产生以下结果:

   A        u        R        C
00000001 01111101 00000001 1
01010100 01111101 01010100 3
10010111 01111101 00010101 3
10110000 01111101 00110000 2
11000100 01111101 01000100 2
10000011 01111101 00000001 1
11110101 01111101 01110101 5
10100000 01111101 00100000 1

在上面的示例中,R 包含 A 位乘以 u 位的结果,为了获得最终值,我们必须 对一行中的所有位求和。请注意,C 列包含的值等于在上面生成的 Traditional 矩阵乘法的第一列中找到的值。问题是,在这一步中,我必须对一个单独的位进行操作,我认为这是次优的方法,我已经通读了 http://graphics.stanford.edu/~seander/bithacks.html寻找一种方法来并行执行此操作但没有运气,如果有人对如何将 R 列中的值“展平”和“合并”到生成的 64 位矩阵中有任何想法,我会如果你给我几行,我将不胜感激,

谢谢,

编辑

非常感谢 David Eisenstat,最终算法如下所示:

var A = ...;
var T = ...; // T == transpose(t), t is original matrix, algorithm works with transposed matrix

var D = 0x8040201008040201UL;

U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D);

下面这段代码:

    public static void Main (string[] args){
ulong U;
var Random = new Xor128 ();

var timer = DateTime.Now;

var A = Random.As<IUniformRandom<UInt64>>().Evaluate();
var T = Random.As<IUniformRandom<UInt64>>().Evaluate();

var steps = 10000000;

for (var i = 0; i < steps; i++) {
ulong r = 0;

var d = 0x8040201008040201UL;

U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d);
}

Console.WriteLine (DateTime.Now - timer);


var m1 = new Int32[8,8];
var m2 = new Int32[8,8];
var m3 = new Int32[8,8];

for (int row = 0; row < 8; row++) {
for (int col = 0; col < 8; col++) {
m1 [row, col] = Random.As<IUniformRandom<Int32>> ().Evaluate(0, 1);
m2 [row, col] = Random.As<IUniformRandom<Int32>> ().Evaluate(0, 1);
m3 [row, col] = Random.As<IUniformRandom<Int32>> ().Evaluate(0, 1);
}
}

timer = DateTime.Now;

for (int i = 0; i < steps; i++) {
for (int row = 0; row < 8; row++) {
for (int col = 0; col < 8; col++) {
var sum = 0;

for (int temp = 0; temp < 8; temp++) {
sum += m1 [row, temp] * m2 [temp, row];
}

m3 [row, col] = sum;
}
}
}

Console.WriteLine (DateTime.Now - timer);
}

显示以下结果:

00:00:02.4035870
00:00:57.5147150

在 Mac OS X/Mono 下性能提高了 23 倍,谢谢大家

最佳答案

我不确定效率如何,但可以尝试一下。以下指令序列计算乘积 A * T' 的主对角线。将 T 和 D 都旋转 8 位并重复 7 次以上的迭代。

// uint64_t A, T;
uint64_t D = UINT64_C(0x8040201008040201);
uint64_t P = A & T;
// test whether each byte is nonzero
P |= P >> 1;
P |= P >> 2;
P |= P >> 4;
P &= UINT64_C(0x0101010101010101);
// fill each nonzero byte with ones
P *= 255; // or P = (P << 8) - P;
// leave only the current diagonal
P &= D;

关于algorithm - 二进制矩阵乘法 bit twiddling hack,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18447321/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com