gpt4 book ai didi

performance - 乘加运算的 Haskell 数学性能

转载 作者:行者123 更新时间:2023-12-03 10:29:20 24 4
gpt4 key购买 nike

我正在用 Haskell 编写游戏,而我目前在 UI 上的通过涉及大量的几何程序生成。我目前专注于识别一个特定操作的性能(C-ish 伪代码):

Vec4f multiplier, addend;
Vec4f vecList[];
for (int i = 0; i < count; i++)
vecList[i] = vecList[i] * multiplier + addend;

也就是说,四个浮点数的沼泽标准乘加,适合 SIMD 优化的那种东西。

结果将发送到 OpenGL 顶点缓冲区,因此最终必须转储到平面 C 数组中。出于同样的原因,可能应该在 C 'float' 类型上进行计算。

我已经在 Haskell 中寻找库或本地惯用解决方案来快速完成此类事情,但我提出的每个解决方案似乎都徘徊在性能的 2% 左右(即慢 50 倍)与来自 GCC 的 C 具有正确的标志。诚然,我几周前开始使用 Haskell,所以我的经验有限——这就是我来找你们的原因。你们中的任何人都可以为更快的 Haskell 实现提供建议,或者提供有关如何编写高性能 Haskell 代码的文档的指针吗?

首先,最新的 Haskell 解决方案(时钟大约 12 秒)。我尝试了来自 this SO post 的 bang-patterns 的东西,但 AFAICT 并没有什么不同。将 'multAdd' 替换为 '(\i v -> v * 4)' 将执行时间降低到 1.9 秒,因此按位计算(以及随之而来的对自动优化的挑战)似乎并没有太大的错误。
{-# LANGUAGE BangPatterns #-}
{-# OPTIONS_GHC -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=native #-}

import Data.Vector.Storable
import qualified Data.Vector.Storable as V
import Foreign.C.Types
import Data.Bits

repCount = 10000
arraySize = 20000

a = fromList $ [0.2::CFloat, 0.1, 0.6, 1.0]
m = fromList $ [0.99::CFloat, 0.7, 0.8, 0.6]

multAdd :: Int -> CFloat -> CFloat
multAdd !i !v = v * (m ! (i .&. 3)) + (a ! (i .&. 3))

multList :: Int -> Vector CFloat -> Vector CFloat
multList !count !src
| count <= 0 = src
| otherwise = multList (count-1) $ V.imap multAdd src

main = do
print $ Data.Vector.Storable.sum $ multList repCount $
Data.Vector.Storable.replicate (arraySize*4) (0::CFloat)

这就是我在 C 中的内容。这里的代码有一些 #ifdefs 可以防止它被直接编译;向下滚动查看测试驱动程序。
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

typedef float v4fs __attribute__ ((vector_size (16)));
typedef struct { float x, y, z, w; } Vector4;

void setv4(v4fs *v, float x, float y, float z, float w) {
float *a = (float*) v;
a[0] = x;
a[1] = y;
a[2] = z;
a[3] = w;
}

float sumv4(v4fs *v) {
float *a = (float*) v;
return a[0] + a[1] + a[2] + a[3];
}

void vecmult(v4fs *MAYBE_RESTRICT s, v4fs *MAYBE_RESTRICT d, v4fs a, v4fs m) {
for (int j = 0; j < N; j++) {
d[j] = s[j] * m + a;
}
}

void scamult(float *MAYBE_RESTRICT s, float *MAYBE_RESTRICT d,
Vector4 a, Vector4 m) {
for (int j = 0; j < (N*4); j+=4) {
d[j+0] = s[j+0] * m.x + a.x;
d[j+1] = s[j+1] * m.y + a.y;
d[j+2] = s[j+2] * m.z + a.z;
d[j+3] = s[j+3] * m.w + a.w;
}
}

int main () {
v4fs a, m;
v4fs *s, *d;

setv4(&a, 0.2, 0.1, 0.6, 1.0);
setv4(&m, 0.99, 0.7, 0.8, 0.6);

s = calloc(N, sizeof(v4fs));
d = s;

double start = clock();
for (int i = 0; i < M; i++) {

#ifdef COPY
d = malloc(N * sizeof(v4fs));
#endif

#ifdef VECTOR
vecmult(s, d, a, m);
#else
Vector4 aa = *(Vector4*)(&a);
Vector4 mm = *(Vector4*)(&m);
scamult((float*)s, (float*)d, aa, mm);
#endif

#ifdef COPY
free(s);
s = d;
#endif
}
double end = clock();

float sum = 0;
for (int j = 0; j < N; j++) {
sum += sumv4(s+j);
}
printf("%-50s %2.5f %f\n\n", NAME,
(end - start) / (double) CLOCKS_PER_SEC, sum);
}

该脚本将编译并运行带有许多 gcc 标志组合的测试。 cmath-64-native-O3-restrict-vector-nocopy 在我的系统上获得了最佳性能,耗时 0.22 秒。
import System.Process
import GHC.IOBase

cBase = ("cmath", "gcc mult.c -ggdb --std=c99 -DM=10000 -DN=20000")
cOptions = [
[("32", "-m32"), ("64", "-m64")],
[("generic", ""), ("native", "-march=native -msse4")],
[("O1", "-O1"), ("O2", "-O2"), ("O3", "-O3")],
[("restrict", "-DMAYBE_RESTRICT=__restrict__"),
("norestrict", "-DMAYBE_RESTRICT=")],
[("vector", "-DVECTOR"), ("scalar", "")],
[("copy", "-DCOPY"), ("nocopy", "")]
]

-- Fold over the Cartesian product of the double list. Probably a Prelude function
-- or two that does this, but hey. The 'perm' referred to permutations until I realized
-- that this wasn't actually doing permutations. '
permfold :: (a -> a -> a) -> a -> [[a]] -> [a]
permfold f z [] = [z]
permfold f z (x:xs) = concat $ map (\a -> (permfold f (f z a) xs)) x

prepCmd :: (String, String) -> (String, String) -> (String, String)
prepCmd (name, cmd) (namea, cmda) =
(name ++ "-" ++ namea, cmd ++ " " ++ cmda)

runCCmd name compileCmd = do
res <- system (compileCmd ++ " -DNAME=\\\"" ++ name ++ "\\\" -o " ++ name)
if res == ExitSuccess
then do system ("./" ++ name)
return ()
else putStrLn $ name ++ " did not compile"

main = do
mapM_ (uncurry runCCmd) $ permfold prepCmd cBase cOptions

最佳答案

Roman Leschinkskiy 回应:

Actually, the core looks mostly ok to me. Using unsafeIndex instead of (!) makes the program more than twice as fast (see my answer above). The program below is much faster, though (and cleaner, IMO). I suspect the remaining difference between this and the C program is due to GHC's general suckiness when it comes to floating point. The HEAD produces the best results with the NCG and -msse2



首先,定义一个新的 Vec4 数据类型:
{-# LANGUAGE BangPatterns #-}

import Data.Vector.Storable
import qualified Data.Vector.Storable as V
import Foreign
import Foreign.C.Types

-- Define a 4 element vector type
data Vec4 = Vec4 {-# UNPACK #-} !CFloat
{-# UNPACK #-} !CFloat
{-# UNPACK #-} !CFloat
{-# UNPACK #-} !CFloat

确保我们可以将其存储在数组中
instance Storable Vec4 where
sizeOf _ = sizeOf (undefined :: CFloat) * 4
alignment _ = alignment (undefined :: CFloat)

{-# INLINE peek #-}
peek p = do
a <- peekElemOff q 0
b <- peekElemOff q 1
c <- peekElemOff q 2
d <- peekElemOff q 3
return (Vec4 a b c d)
where
q = castPtr p
{-# INLINE poke #-}
poke p (Vec4 a b c d) = do
pokeElemOff q 0 a
pokeElemOff q 1 b
pokeElemOff q 2 c
pokeElemOff q 3 d
where
q = castPtr p

此类型的值和方法:
a = Vec4 0.2 0.1 0.6 1.0
m = Vec4 0.99 0.7 0.8 0.6

add :: Vec4 -> Vec4 -> Vec4
{-# INLINE add #-}
add (Vec4 a b c d) (Vec4 a' b' c' d') = Vec4 (a+a') (b+b') (c+c') (d+d')

mult :: Vec4 -> Vec4 -> Vec4
{-# INLINE mult #-}
mult (Vec4 a b c d) (Vec4 a' b' c' d') = Vec4 (a*a') (b*b') (c*c') (d*d')

vsum :: Vec4 -> CFloat
{-# INLINE vsum #-}
vsum (Vec4 a b c d) = a+b+c+d

multList :: Int -> Vector Vec4 -> Vector Vec4
multList !count !src
| count <= 0 = src
| otherwise = multList (count-1) $ V.map (\v -> add (mult v m) a) src

main = do
print $ Data.Vector.Storable.sum
$ Data.Vector.Storable.map vsum
$ multList repCount
$ Data.Vector.Storable.replicate arraySize (Vec4 0 0 0 0)

repCount, arraySize :: Int
repCount = 10000
arraySize = 20000

使用 ghc 6.12.1,-O2 -fasm:
  • 1.752

  • 使用 ghc HEAD(6 月 26 日),-O2 -fasm -msse2
  • 1.708

  • 这看起来像是编写 Vec4 数组的最惯用方式,并且获得了最佳性能(比原来的速度快 11 倍)。 (这可能会成为 GHC 的 LLVM 后端的基准)

    关于performance - 乘加运算的 Haskell 数学性能,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/3115540/

    24 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com