c# - 缓存友好优化 : Object oriented matrix multiplication and in-function tiled matrix multiplication-6ren

c# - 缓存友好优化 : Object oriented matrix multiplication and in-function tiled matrix multiplication

转载作者：太空宇宙更新时间：2023-11-04 14:04:03

24

4

在使用 this implementation 编写了表示两个一维缓冲区中的整个矩阵的矩阵类之后，我已经达到了我项目中的矩阵混合部分，现在倾向于一些缓存友好的优化。偶然发现了两个选项(问题在本页的下半部分):

1)在乘法时间内选择分块/分块子矩阵。

在 C++ DLL 函数中完成，因此没有函数开销。
由于代码会更复杂，因此更难应用额外的优化。

2)从子矩阵类(较小的补丁)构建矩阵类，因此乘法通常在子矩阵类上完成。

面向对象的方法为子矩阵的额外优化留出了空间。
C# 的对象 header 和填充行为可以帮助克服关键步骤吗？
函数开销在调用多次而不是几次后可能会成为问题。

示例矩阵乘法:C=A.B

 A
 1 2 3 4   is used as  1 2    3 4  
 4 3 4 2               4 3    4 2  
 1 3 1 2                           
 1 1 1 2               1 3    1 2  
                       1 1    1 2


 B
 1 1 1 1  --->         1 1    1 1
 1 1 1 1               1 1    1 1
 1 1 1 1 
 1 1 1 1               1 1    1 1
                       1 1    1 1


 Multiplication:       1 2 * 1 1   +   3 4 * 1 1    ==>  upper-left tile of result
                       4 3   1 1       4 2   1 1  


                       same for the upper-right of result

                       1 3 * 1 1   +   1 2 * 1 1    ==> lower left tile of result
                       1 1   1 1       1 2   1 1

                       same for lower-right tile of result

        Multiplication is O(n³) but summation is O(n²).

问题:有没有人同时尝试过(函数式和面向对象)并进行了性能比较？现在，在没有任何这些针对缓存的优化的情况下，我的天真乘法需要:

 Matrix Size   Single Threaded Time    Multithreaded Time
 * 128x128   :     5    ms                 1ms-5ms(time sample error is bigger)
 * 256x256   :     25   ms                 7     ms
 * 512x512   :     140  ms                 35    ms
 * 1024x1024 :     1.3  s                  260   ms
 * 2048x2048 :     11.3 s                  2700  ms 
 * 4096x4096 :     88.1 s                  24    s
 * 8192x8192 :     710  s                  177   s

 Giga-multiplications of variables per second
               Single threaded         Multithreaded           Multi/single ratio               
 * 128x128   :     0.42                    2.0 - 0.4               ?
 * 256x256   :     0.67                    2.39                   3.67x  
 * 512x512   :     0.96                    3.84                   4.00x
 * 1024x1024 :     0.83                    3.47                   4.18x
 * 2048x2048 :     0.76                    3.18                   4.18x
 * 4096x4096 :     0.78                    2.86                   3.67x
 * 8192x8192 :     0.77                    3.09                   4.01x

(1.4GHz fx8150 的平均结果，使用 32 位 float 的 avx 优化代码)(Visual Studio C# 的 Parallel.For() 中 dll 函数中的 c++ avx-intrinsics)

以上哪种大小的矩阵可能会遭受缓存未命中、关键步幅和其他不良事件的影响？您知道如何获取使用内部函数的性能计数器吗？

谢谢你的时间。

编辑: DLL 内联优化:

 Matrix Size   Single Threaded Time    Multithreaded Time           Multi/Single radio
 * 128x128   :  1     ms(%400)          390us avrage in 10k iterations(6G mult /s)
 * 256x256   :  12    ms(%108 faster)   2     ms   (%250 faster)         6.0x
 * 512x512   :  73    ms(%92 faster)    15    ms   (%133 faster)         4.9x
 * 1024x1024 :  1060  ms(%22 faster)    176   ms   (%48 faster)          6.0x
 * 2048x2048 :  10070 ms(%12 faster)    2350  ms   (%15 faster)          4.3x
 * 4096x4096 :  82.0  s(%7 faster)      22    s    (%9 faster)           3.7x
 * 8192x8192 :  676   s(%5 faster)      174   s    (%2 faster)           4.1x

在内联之后，较小乘法的阴影性能变得可见。仍然存在 DLL-function-C# 开销。 1024x1024 的情况似乎是缓存未命中的起点。虽然工作量仅增加了 7 倍，但执行时间却增加了 15 倍。

编辑::本周将使用面向对象的方法尝试 Strassen 的 3 层深度算法。主矩阵将由 4 个子矩阵组成。然后他们将分别由4个子子组成。然后他们将分别由4个sub-sub-subs组成。这应该提供接近 (8/7)(8/7)(8/7)= +%50 的加速。如果有效，会将 DLL 函数转换为补丁优化函数，这将使用更多缓存。

最佳答案

将 Strassen 算法仅应用于一层(例如 256x256 中的四层作为 512x512)作为一种面向对象的方法(父类(super class)是 Strassen，子矩阵是 matrix类):

 Matrix Size   Single Threaded Time    Multithreaded Time           Multi/Single radio
 * 128x128   :  **%50 slowdown**        **slowdown**              
 * 256x256   :  **%30 slowdown**        **slowdown**         
 * 512x512   :  **%10 slowdown**        **slowdown**            
 * 1024x1024 :  540   ms(%96 faster)    130   ms   (%35 faster)     4.15     
 * 2048x2048 :  7500  ms(%34 faster)    1310  ms   (%79 faster)     5.72    
 * 4096x4096 :  70.2  s(%17 faster)     17    s    (%29 faster)     4.13
 * 6144x6144 :          x               68    s               
 * 8192x8192 :  outOfMemoryException    outOfMemoryException

DLL 函数和 C# 之间的开销仍然有效，因此小矩阵无法变得更快。但是当有加速时，它总是超过 8/7(%14)，因为使用更小的 block 更有利于缓存使用。

将编写一个基准类，反复测试 Stressen 算法与朴素算法的不同叶子大小，以找到临界大小。 (对于我的系统，它是 512x512)。

父类(super class)将递归构建子矩阵树，直到它达到 512x512 大小，并将对 512x512 节点使用朴素算法。然后在 DLL 函数中，一个修补/阻塞算法(将在下周添加)将使它更快一些。但我不知道如何选择合适的补丁大小，因为我不知道如何获得 cpu 的缓存行大小。将在递归 Strassen 完成后搜索它。

我对 Strassen 算法的实现需要五倍的内存(正在处理)。

编辑:一些递归已完成，在结果出现时更新表格。

 Matrix Size   Single Threaded Time    Multithreaded Time           
 * 2048x2048 :  x                       872     ms  average(double layer)    
 * 2560x2560 :  x                       1527    ms  average(double layer)

并行化树解析，减少内存占用并引入完全递归:

 Matrix Size   Single Threaded Time    Multithreaded Time                         m/s
 * 1024x1024 :  547   ms               123     ms  average(single layer)          4.45x
 * 2048x2048 :  3790  ms               790     ms  average(double layer)          4.79x
 * 4096x4096 :  26.4  s                5440    ms  average(triple layer)          4.85x
 * 8192x8192 :  185   s                38      s   average(quad layer)            4.87x
 * 8192x8192(4GHz):   x                15      s   average(quad layer)            4.87x

每秒乘法 (x10^9):

 Matrix Size   Single Threaded         Multithreaded                          
 * 1024x1024 :  1.71                   7.64     (single layer)          
 * 2048x2048 :  1.73                   8.31     (double layer)          
 * 4096x4096 :  1.74                   8.45     (triple layer)         
 * 8192x8192 :  1.74                   8.48     (quad layer)           
 * 8192x8192(4GHz):   x                21.49    (quad layer)           
 Strassen's cpu flops is multiplied by 7/8 for each layer.

刚发现使用 opencl 使用价格相似的 gpu 可以在 1 秒内完成 8kx8k。

关于c# - 缓存友好优化 : Object oriented matrix multiplication and in-function tiled matrix multiplication，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/17785736/

24

4

0

文章推荐： css - 在 float div 中居中固定 div

文章推荐： java - 如何将图标添加到抽屉导航中

jquery - 从文件输入中获取文件名或文本 multiple=multiple
我正在尝试设计我的输入:文件。以下 SO 问题让我完成了 95% 的任务。区别在于我使用的是 HTML5 multiple=multiple 属性。 How to style "input file"
Python Pandas : Store multiple time series of variable length for multiple attributes for multiple users
我一直在进行一项实验，其中多个调查参与者使用可穿戴技术聆听多首音乐来跟踪多条信息，两个例子是 BPM(心率)和 T(体温)。目标是衡量每首音乐(以用户反馈为特征)对人类情感的影响。目前，所有数据都
PHP : multiple inputs into other multiple inputs
我使用 jquery 添加/删除输入我使用append为日期/收入添加多个Tr 我还使用另一个附加来添加多个 td 以获取同一日期 Tr 中的收入我添加多个日期输入，并在此表中添加多个收入输入我
安卓工作室 : Multiple APKs for Multiple Modules
在 Android 中，有一种方法可以为项目中的所有模块生成签名的 APK。例如。我有以下项目 Project -- Library Module -- Module 1 -- Modul
mysql - 数据库设计: How to query multiple to multiple table
我有一个用于网站展示的系统。展览数据可能来自差异表中的多个数据。喜欢这个设计: Table [ExhibitionType] used for differentiate category. Ta
ios - 什么是最佳实践 : Multiple UIViews or Multiple UIViewControllers
关闭。这个问题是opinion-based .它目前不接受答案。想要改进这个问题？更新问题，以便 editing this post 可以用事实和引用来回答它. 关闭 8 年前。 Improve
iOS : Handle multiple uilocalnotification with multiple same viewcontroller
我正在使用 UILocalnotification...收到通知时，当应用程序处于事件模式时我打开 viewcontroller...但是如果同时收到多个通知...我如何打开多个 viewcontro
PHP/MySQL : UPDATE multiples values with multiples WHERE
我遇到的问题是一个策略浏览器游戏，它有 7 种类型的值。问题如下: 我在 $_POST 中获得了 7 个不同的值，包括从索引 unit_1 到索引 unit_7。这 7 个值是 0 到 20 之间的整
python : How to search for multiple paterns in multiple files
这个问题已经有答案了: Search Large Text File for Thousands of strings (3 个回答) 已关闭10 年前。我想在多个文件上“grep”多个正则表达式。
javascript - 大型数据集的单个 + SQL 查询搜索
我正在尝试根据进行搜索我通过运行代码从 select 中获取值: for($i=0;$i= '$age_from' AND users.user_age = '$age_from' AND u

首页

博学

6Ren·AI

商城

c# - 缓存友好优化 : Object oriented matrix multiplication and in-function tiled matrix multiplication