performance - 分析 Haskell 程序-6ren

performance - 分析 Haskell 程序

转载作者：行者123 更新时间：2023-12-03 11:25:17

我有一段代码使用 sequence 从概率分布中重复采样。从道德上讲，它做了这样的事情:

sampleMean :: MonadRandom m => Int -> m Float -> m Float
sampleMean n dist = do
  xs <- sequence (replicate n dist)
  return (sum xs)

除了它有点复杂。我感兴趣的实际代码是 this Github repo 的函数 likelihoodWeighting 。

我注意到运行时间与 n 呈非线性关系。特别是，一旦 n 超过某个值，它就会达到内存限制，并且运行时间会爆炸。我不确定，但我认为这是因为 sequence 正在建立一个长长的 thunk 列表，这些 thunk 直到调用 sum 才会被评估。

一旦我超过了大约 100,000 个样本，程序就会慢下来。我想对此进行优化(我的感觉是 1000 万个样本不应该成为问题)所以我决定对其进行分析 - 但我在理解分析器的输出时遇到了一些麻烦。

分析

我在 main.hs 文件中创建了一个简短的可执行文件，它使用 100,000 个样本运行我的函数。这是做的输出

$ ghc -O2 -rtsopts main.hs
$ ./main +RTS -s

我注意到的第一件事 - 它分配了近 1.5 GB 的堆，并将 60% 的时间用于垃圾收集。这通常表明过于懒惰吗？

 1,377,538,232 bytes allocated in the heap
 1,195,050,032 bytes copied during GC
   169,411,368 bytes maximum residency (12 sample(s))
     7,360,232 bytes maximum slop
           423 MB total memory in use (0 MB lost due to fragmentation)

Generation 0:  2574 collections,     0 parallel,  2.40s,  2.43s elapsed
Generation 1:    12 collections,     0 parallel,  1.07s,  1.28s elapsed

INIT  time    0.00s  (  0.00s elapsed)
MUT   time    1.92s  (  1.94s elapsed)
GC    time    3.47s  (  3.70s elapsed)
RP    time    0.00s  (  0.00s elapsed)
PROF  time    0.23s  (  0.23s elapsed)
EXIT  time    0.00s  (  0.00s elapsed)
Total time    5.63s  (  5.87s elapsed)

%GC time      61.8%  (63.1% elapsed)

Alloc rate    716,368,278 bytes per MUT second

Productivity  34.2% of total user, 32.7% of total elapsed

以下是来自的结果

$ ./main +RTS -p

第一次运行时，发现有一个函数被重复调用，结果我可以记住它，这使速度提高了 2 倍。然而，它并没有解决空间泄漏问题。

COST CENTRE           MODULE                no. entries  %time %alloc   %time %alloc

MAIN                  MAIN                    1        0   0.0    0.0   100.0  100.0
 main                 Main                  434        4   0.0    0.0   100.0  100.0
  likelihoodWeighting AI.Probability.Bayes  445        1   0.0    0.3   100.0  100.0
   distributionLW     AI.Probability.Bayes  448        1   0.0    2.6     0.0    2.6
   getSampleLW        AI.Probability.Bayes  446   100000  20.0   50.4   100.0   97.1
    bnProb            AI.Probability.Bayes  458   400000   0.0    0.0     0.0    0.0
    bnCond            AI.Probability.Bayes  457   400000   6.7    0.8     6.7    0.8
    bnVals            AI.Probability.Bayes  455   400000  20.0    6.3    26.7    7.1
     bnParents        AI.Probability.Bayes  456   400000   6.7    0.8     6.7    0.8
    bnSubRef          AI.Probability.Bayes  454   800000  13.3   13.5    13.3   13.5
    weightedSample    AI.Probability.Bayes  447   100000  26.7   23.9    33.3   25.3
     bnProb           AI.Probability.Bayes  453   100000   0.0    0.0     0.0    0.0
     bnCond           AI.Probability.Bayes  452   100000   0.0    0.2     0.0    0.2
     bnVals           AI.Probability.Bayes  450   100000   0.0    0.3     6.7    0.5
      bnParents       AI.Probability.Bayes  451   100000   6.7    0.2     6.7    0.2
     bnSubRef         AI.Probability.Bayes  449   200000   0.0    0.7     0.0    0.7

这是一个堆配置文件。我不知道为什么它声称运行时间是 1.8 秒——这次运行大约需要 6 秒。

enter image description here

谁能帮我解释分析器的输出 - 即确定瓶颈在哪里，并就如何加快速度提供建议？

最佳答案

通过合并 JohnL's suggestion 已经实现了巨大的改进。使用 foldM在 likelihoodWeighting .这将这里的内存使用量减少了大约十倍，并将 GC 时间显着降低到几乎或实际上可以忽略不计。

使用当前源运行分析会产生

probabilityIO              AI.Util.Util          26.1   42.4    413 290400000
weightedSample.go          AI.Probability.Bayes  16.1   19.1    255 131200080
bnParents                  AI.Probability.Bayes  10.8    1.2    171   8000384
bnVals                     AI.Probability.Bayes  10.4    7.8    164  53603072
bnCond                     AI.Probability.Bayes   7.9    1.2    125   8000384
ndSubRef                   AI.Util.Array          4.8    9.2     76  63204112
bnSubRef                   AI.Probability.Bayes   4.7    8.1     75  55203072
likelihoodWeighting.func   AI.Probability.Bayes   3.3    2.8     53  19195128
%!                         AI.Util.Util           3.3    0.5     53   3200000
bnProb                     AI.Probability.Bayes   2.5    0.0     40        16
bnProb.p                   AI.Probability.Bayes   2.5    3.5     40  24001152
likelihoodWeighting        AI.Probability.Bayes   2.5    2.9     39  20000264
likelihoodWeighting.func.x AI.Probability.Bayes   2.3    0.2     37   1600000

-s 报告的内存使用量为 13MB , ~5MB 最大驻留。这已经不算太糟糕了。

尽管如此，我们仍有一些可以改进的地方。首先，一个相对较小的事情，在宏伟的计划中， AI.UTIl.Array.ndSubRef :

ndSubRef :: [Int] -> Int
ndSubRef ns = sum $ zipWith (*) (reverse ns) (map (2^) [0..])

反转列表，映射 (2^)在另一个列表上效率低下，更好的是

ndSubRef = L.foldl' (\a d -> 2*a + d) 0

它不需要将整个列表保留在内存中(可能不是什么大问题，因为列表会很短)，因为它可以反转它，并且不需要分配第二个列表。分配的减少是显着的，大约 10%，而且这部分运行得更快，

ndSubRef                   AI.Util.Array          1.7    1.3     24   8000384

在修改运行的profile中，但由于只占用了整体时间的一小部分，所以整体影响很小。 weightedSample 中可能有更大的鱼要炸和 likelihoodWeighting .

让我们在 weightedSample 中添加一点严格性。看看这如何改变事情:

weightedSample :: Ord e => BayesNet e -> [(e,Bool)] -> IO (Map e Bool, Prob)
weightedSample bn fixed =
    go 1.0 (M.fromList fixed) (bnVars bn)
    where
        go w assignment []     = return (assignment, w)
        go w assignment (v:vs) = if v `elem` vars
            then
                let w' = w * bnProb bn assignment (v, fixed %! v)
                in go w' assignment vs
            else do
                let p = bnProb bn assignment (v,True)
                x <- probabilityIO p
                go w (M.insert v x assignment) vs

        vars = map fst fixed

go的权重参数从不强制，赋值参数也不是，因此它们可以建立 thunk。让我们启用 {-# LANGUAGE BangPatterns #-}并强制更新立即生效，同时评估 p在传递给 probabilityIO 之前:

go w assignment (v:vs) = if v `elem` vars
    then
        let !w' = w * bnProb bn assignment (v, fixed %! v)
        in go w' assignment vs
    else do
        let !p = bnProb bn assignment (v,True)
        x <- probabilityIO p
        let !assignment' = M.insert v x assignment
        go w assignment' vs

这带来了分配的进一步减少 (~9%) 和小幅加速 (~%13%)，但总内存使用量和最大驻留没有太大变化。

我看不出还有什么明显的变化，所以让我们看看 likelihoodWeighting :

func m _ = do
    (a, w) <- weightedSample bn fixed
    let x = a ! e
    return $! x `seq` w `seq` M.adjust (+w) x m

在最后一行，首先， w已在 weightedSample 中评估现在，所以我们不需要 seq它在这里，关键 x需要评估更新后的 map ，所以 seq这也没有必要。那条线上的坏事是 M.adjust . adjust无法强制更新函数的结果，因此会在 map 的值中构建 thunk。您可以通过查找修改后的值并强制执行对 thunk 的评估，但是 Data.Map这里提供了一种更方便的方法，因为保证 map 更新的键存在， insertWith' :

func !m _ = do
    (a, w) <- weightedSample bn fixed
    let x = a ! e
    return (M.insertWith' (+) x w m)

(注意:GHC 在 m 上使用 bang-pattern 比在 return $! ... 上优化更好)。这略微减少了总分配并且不会显着改变运行时间，但对使用的总内存和最大驻留有很大影响:

 934,566,488 bytes allocated in the heap
   1,441,744 bytes copied during GC
      68,112 bytes maximum residency (1 sample(s))
      23,272 bytes maximum slop
           1 MB total memory in use (0 MB lost due to fragmentation)

最大的运行时间改进是避免 randomIO。 , 使用的 StdGen很慢。

我很惊讶 bn* 花了多少时间功能采取，但看不到任何明显的低效率。

关于performance - 分析 Haskell 程序，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/11594481/

文章推荐： javascript - prop `store.subscribe` 标记为必填

文章推荐： linq - Linq to NHibernate 是否不再活跃？

performance - "performant"软件究竟是什么意思？
关闭。这个问题是opinion-based .它目前不接受答案。想改善这个问题吗？更新问题，以便可以通过 editing this post 用事实和引文回答问题. 8年前关闭。 Improve t
performance - 灿灿授权: Performance Issue
暂时忘记能力的定义，只关注能力的“检查”(使用“授权!”)，我看到 CanCan 添加了大约 400 毫秒，用于简单地检查用户是否具有特定的能力主题/模型。这是预期的吗(我假设不是)？或者，有没有可
performance - Swift 显式与推断类型 : Performance
我正在阅读有关 Swift 的教程 ( http://www.raywenderlich.com/74438/swift-tutorial-a-quick-start )，它预定义为不显式设置类型，因
performance - 编码优先级 : Performance, 可维护性、可重用性？
这主要是由于对 SQL 问题的回答。由于性能原因，有意省略了 UDF 和子查询。我没有包括可靠性并不是说它应该被视为理所当然，但代码必须工作。性能永远是第一位的吗？提供了许多以性能为主要优先事项的答
performance - Scala递归与循环: performance and runtime considerations
我已经编写了一个简单的测试平台来测量三种阶乘实现的性能:基于循环的，非尾递归的和尾递归的。 Surprisingly to me the worst performant was the loop o
performance - ui-performance 插件无法在开发模式下工作 (Grails)
我已将 ui-performance 插件应用到我的应用程序中。不幸的是，在开发模式下运行应用程序时它似乎不起作用。例如，我的 javascript 导入是用“vnull”版本呈现的。例如不会
performance - 编译 F# 引用 : performance?
我有一个我操作的 F# 引用(我在各处添加对象池以回收经常创建和删除的短期对象)。我想运行结果报价；现在我使用了 F# PowerPack，它提供了将引用转换为表达式树和委托(delegate)的方法
performance - Spark独立: SparklyR : Performance issues
我正在尝试在 Spark 服务器上运行 SparklyR 库中的机器学习算法。 1 个簇 8 核 24G内存 Ubuntu 16.04 星火2.2 独立配置 1名师傅/2名 worker 每个执行器的
performance - 架构和索引以及主键 : Differences in lookup performance?
我有一个数据库(准确地说是在 postgres 上运行)，具有以下结构: user1 (schema) | - cars (table) - airplanes (table, again) .
performance - iOS/核心动画 : Performance tuning
我的应用程序在我的 iPad 上运行。但它的表现非常糟糕——我的速度低于 15fps。谁能帮我优化一下？它基本上是一个轮子(派生自 UIView)，包含 12 个按钮(派生自 UIControl)。
performance - coursera progfun1 : scala union performance
在完成“Scala 中的函数式编程原则”@coursera 类(class)第 3 周的作业时，我发现当我实现视频类(class)中所示的函数联合时: override def union(tha
performance - Symfony2 依赖注入(inject) : performances impact
我正在重构我的一个 Controller 以使其成为一项服务，我想知道不将整个服务容器注入(inject)我的 Controller 是否会对性能产生影响。这样效率更高吗: innova.path.
performance - facelet tag performance
我有一个要显示的内容很大的文件。例如在显示用户配置文件时，中的每个 EL 表达式需要一个 userId 作为 bean 的参数，该参数取自 session 上下文。我在 xhtml 文件中将这个 u
performance - OpenGL/DirectX : How does Mipmapping improve performance?
我非常了解 mipmapping。我不明白(在硬件/驱动程序级别)是 mipmapping 如何提高应用程序的性能(至少这是经常声称的)。在执行片段着色器之前，驱动程序不知道要访问哪个 mipmap
performance - Scala 惰性值 : performance penalty? 线程安全？
这个问题在这里已经有了答案: 10年前关闭。 Possible Duplicate: What's the (hidden) cost of lazy val? (Scala) Scala 允许定义惰
java - build().perform() 和 Perform() 之间有什么区别
一些文章建议现在 build() 包含在 perform() 本身中，而其他人则建议当要链接多个操作时使用 build().perform()一起。最佳答案 build() 包含在 perform(
performance - postgres 函数 : when does IMMUTABLE hurt performance?
Postgres docs说 For best optimization results, you should label your functions with the strictest vol
performance - 零成本抽象 : performance of for-loop vs. 迭代器
阅读Zero-cost abstractions看着 Introduction to rust: a low-level language with high-level abstractions我尝
performance - MQ : CPU Performance 上的 SSL
我想在 MQ 服务器上部署 SSL，但我想知道我当前的 CPU 容量是否支持 SSL。 (我没有预算增加 CPU 内核和 MQ PVU 的数量) 我的规范: Windows 2003 服务器 SP2，
performance - Chrome Performance Profiler 中的“Timings”选项卡丢失
因此，我在 Chrome 开发者工具的性能选项卡内的时间部分成功地监控了我的 React Native 应用程序的性能。突然在应用程序的特定重新加载时，Timings 标签丢失。我已尝试重置

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

performance - 分析 Haskell 程序