c - 为什么从 C 调用 Haskell 函数会有开销？

转载作者：太空狗更新时间：2023-10-29 16:31:03

我注意到在 C 中调用 Haskell 函数的开销很大，比原生 C 函数调用的开销大得多。为了从本质上提炼问题，我编写了一个程序，它只初始化 Haskell 运行时，运行一个循环调用一个空函数 100,000,000 次，然后返回。

内联函数后，程序耗时0.003s。调用一个用 C 编写的空函数需要 0.18s。调用一个用 Haskell 编写的空函数需要 15.5 秒。 (奇怪的是，如果我在链接之前单独编译空的Haskell文件，它会多花几秒钟。子问题:这是为什么？)

所以看起来在调用 C 函数和调用 Haskell 函数之间大约有 100 倍的减速。这是什么原因，有什么方法可以缓解这种放缓？

代码

EDIT: I've discovered a version of this test in the NoFib benchmark suite, callback002. There's a nice blog post by Edward Z. Yang mentioning this test in the context of the GHC scheduler. I'm still trying to grok this blog post along with Zeta's very nice answer. I'm not yet convinced that there's not a way to do this faster!

要编译“慢”的 Haskell 版本，运行

ghc -no-hs-main -O2 -optc-O3 test.c Test.hs -o test

要编译“快速”的 C 版本，运行

ghc -no-hs-main -O2 -optc-O3 test.c test2.c TestDummy.hs -o test

测试.c:

#include "HsFFI.h"
extern void __stginit_Test(void);

extern void test();

int main(int argc, char *argv[]) {
  hs_init(&argc, &argv);
  hs_add_root(__stginit_Test);
  int i;
  for (i = 0; i < 100000000; i++) {
    test();
  }
  hs_exit();
  return 0;
}

测试2.c:

void test() {
}

测试.hs:

{-# LANGUAGE ForeignFunctionInterface #-}

module Test where

foreign export ccall test :: ()

test :: ()
test = ()

TestDummy.hs:

module Test where

最佳答案

TL;DR:原因:RTS 和 STG 调用。解决方案:不要从 C 中调用琐碎的 Haskell 函数。

What is the reason for this…?

免责声明:我从未从 C 调用过 Haskell。我熟悉 C 和 Haskell，但我很少将两者交织在一起，除非我正在编写包装器。既然我已经失去了信誉，让我们开始这场基准测试、反汇编和其他漂亮恐怖的冒险吧。

使用 gprof 进行基准测试

检查什么正在占用您的所有资源的一种简单方法是使用 gprof。我们将稍微更改您的编译行，以便编译器和链接器都使用 -pg(注意:我已将 test.c 重命名为 main.c，将 test2.c 重命名为 test.c ):

$ ghc -no-hs-main -O2 -optc-O3 -optc-pg -optl-pg -fforce-recomp \
    main.c Test.hs -o test
$ ./test
$ gprof ./test

这为我们提供了以下(平面)配置文件:

Flat profile:Each sample counts as 0.01 seconds.  %   cumulative   self              self     total            time   seconds   seconds    calls  Ts/call  Ts/call  name     16.85      2.15     2.15                             scheduleWaitThread 11.78      3.65     1.50                             createStrictIOThread  7.66      4.62     0.98                             createThread  6.68      5.47     0.85                             allocate  5.66      6.19     0.72                             traverseWeakPtrList  5.34      6.87     0.68                             isAlive  4.12      7.40     0.53                             newBoundTask  3.06      7.79     0.39                             stg_ap_p_fast  2.36      8.09     0.30                             stg_ap_v_info  1.96      8.34     0.25                             stg_ap_0_fast  1.85      8.57     0.24                             rts_checkSchedStatus  1.81      8.80     0.23                             stg_PAP_apply  1.73      9.02     0.22                             rts_apply  1.73      9.24     0.22                             stg_enter_info  1.65      9.45     0.21                             stg_stop_thread_info  1.61      9.66     0.21                             test  1.49      9.85     0.19                             stg_returnToStackTop  1.49     10.04     0.19                             move_STACK  1.49     10.23     0.19                             stg_ap_v_fast  1.41     10.41     0.18                             rts_lock  1.18     10.56     0.15                             boundTaskExiting  1.10     10.70     0.14                             StgRun  0.98     10.82     0.13                             rts_evalIO  0.94     10.94     0.12                             stg_upd_frame_info  0.79     11.04     0.10                             blockedThrowTo  0.67     11.13     0.09                             StgReturn  0.63     11.21     0.08                             createIOThread  0.63     11.29     0.08                             stg_bh_upd_frame_info  0.63     11.37     0.08                             c5KU_info  0.55     11.44     0.07                             stg_stk_save_n  0.51     11.50     0.07                             threadPaused  0.47     11.56     0.06                             dirty_TSO  0.47     11.62     0.06                             ghczmprim_GHCziCString_unpackCStringzh_info  0.47     11.68     0.06                             stopHeapProfTimer  0.39     11.73     0.05                             stg_threadFinished  0.39     11.78     0.05                             allocGroup  0.39     11.83     0.05                             base_GHCziTopHandler_runNonIO1_info  0.39     11.88     0.05                             stg_catchzh  0.35     11.93     0.05                             freeMyTask  0.35     11.97     0.05                             rts_eval_  0.31     12.01     0.04                             awakenBlockedExceptionQueue  0.31     12.05     0.04                             stg_ap_2_upd_info  0.27     12.09     0.04                             s5q4_info  0.24     12.12     0.03                             markStableTables  0.24     12.15     0.03                             rts_getSchedStatus  0.24     12.18     0.03                             s5q3_info  0.24     12.21     0.03                             scavenge_stack  0.24     12.24     0.03                             stg_ap_7_upd_info  0.24     12.27     0.03                             stg_ap_n_fast  0.24     12.30     0.03                             stg_gc_noregs  0.20     12.32     0.03                             base_GHCziTopHandler_runIO1_info  0.20     12.35     0.03                             stat_exit  0.16     12.37     0.02                             GarbageCollect  0.16     12.39     0.02                             dirty_STACK  0.16     12.41     0.02                             freeGcThreads  0.16     12.43     0.02                             rts_mkString  0.16     12.45     0.02                             scavenge_capability_mut_lists  0.16     12.47     0.02                             startProfTimer  0.16     12.49     0.02                             stg_PAP_info  0.16     12.51     0.02                             stg_ap_stk_p  0.16     12.53     0.02                             stg_catch_info  0.16     12.55     0.02                             stg_killMyself  0.16     12.57     0.02                             stg_marked_upd_frame_info  0.12     12.58     0.02                             interruptAllCapabilities  0.12     12.60     0.02                             scheduleThreadOn  0.12     12.61     0.02                             waitForReturnCapability  0.08     12.62     0.01                             exitStorage  0.08     12.63     0.01                             freeWSDeque  0.08     12.64     0.01                             gcStableTables  0.08     12.65     0.01                             resetTerminalSettings  0.08     12.66     0.01                             resizeNurseriesEach  0.08     12.67     0.01                             scavenge_loop  0.08     12.68     0.01                             split_free_block  0.08     12.69     0.01                             startHeapProfTimer  0.08     12.70     0.01                             stg_MVAR_TSO_QUEUE_info  0.08     12.71     0.01                             stg_forceIO_info  0.08     12.72     0.01                             zero_static_object_list  0.04     12.73     0.01                             frame_dummy  0.04     12.73     0.01                             rts_evalLazyIO_  0.00     12.73     0.00        1     0.00     0.00  stginit_export_Test_zdfstableZZC0ZZCmainZZCTestZZCtest

Woah, that's a bunch of functions getting called. How does this compare to your C version?

$ ghc -no-hs-main -O2 -optc-O3 -optc-pg -optl-pg -fforce-recomp \
    main.c TestDummy.hs test.c -o test_c
$ ./test_c
$ gprof ./test_c

Flat profile:Each sample counts as 0.01 seconds.  %   cumulative   self              self     total            time   seconds   seconds    calls  Ts/call  Ts/call  name     75.00      0.05     0.05                             test 25.00      0.06     0.02                             frame_dummy

That's a lot simpler. But why?

What's happening behind?

Maybe you've wondered why test even showed up in the previous profile. Well, gprof itself adds some overhead, as can be seen with objdump:

$ objdump -D ./test_c | grep -A5 "<test>:"

0000000000405630 <test>:
  405630:   55                      push   %rbp
  405631:   48 89 e5                mov    %rsp,%rbp
  405634:   e8 f7 d4 ff ff          callq  402b30 <mcount@plt>
  405639:   5d                      pop    %rbp
  40563a:   c3                      retq

mcount的调用是gcc添加的。因此，对于下一部分，您要删除 -pg 选项。让我们首先检查 C 中反汇编的 test 例程:

$ ghc -no-hs-main -O2 -optc-O3 -fforce-recomp \ 
    main.c TestDummy.hs test.c -o test_c
$ objdump -D ./test_c | grep -A2 "<test>:"

0000000000405510 <test>:
  405510:   f3 c3                   repz retq

repz retq 实际上是some optimisation magic ，但在这种情况下，您可以将其视为(大部分)无操作返回。

这与 Haskell 版本相比如何？

$ ghc -no-hs-main -O2 -optc-O3 -fforce-recomp \ 
    main.c Test.hs -o test_hs    
$ objdump -D ./Test.o | grep -A18 "<test>:"

0000000000405520 <test>:
  405520:   48 83 ec 18             sub    $0x18,%rsp
  405524:   e8 f7 3a 05 00          callq  459020 <rts_lock>
  405529:   ba 58 24 6b 00          mov    $0x6b2458,%edx
  40552e:   be 80 28 6b 00          mov    $0x6b2880,%esi
  405533:   48 89 c7                mov    %rax,%rdi
  405536:   48 89 04 24             mov    %rax,(%rsp)
  40553a:   e8 51 36 05 00          callq  458b90 <rts_apply>
  40553f:   48 8d 54 24 08          lea    0x8(%rsp),%rdx
  405544:   48 89 c6                mov    %rax,%rsi
  405547:   48 89 e7                mov    %rsp,%rdi
  40554a:   e8 01 39 05 00          callq  458e50 <rts_evalIO>
  40554f:   48 8b 34 24             mov    (%rsp),%rsi
  405553:   bf 64 57 48 00          mov    $0x485764,%edi
  405558:   e8 23 3a 05 00          callq  458f80 <rts_checkSchedStatus>
  40555d:   48 8b 3c 24             mov    (%rsp),%rdi
  405561:   e8 0a 3b 05 00          callq  459070 <rts_unlock>
  405566:   48 83 c4 18             add    $0x18,%rsp
  40556a:   c3                      retq   
  40556b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  405570:   d8 ce                   fmul   %st(6),%st

这看起来很不一样。事实上，RTS功能似乎可疑。让我们来看看它们:

rts_checkSchedStatus只是检查状态是否正常，否则退出。 Success 路径没有太多开销，因此这个函数并不是真正的罪魁祸首。
rts_unlock and rts_lock基本上声明并释放一个capability (虚拟 CPU)。他们调用 newBoundTask 和 boundTaskExiting，这需要一些时间(参见上面的配置文件)。
rts_apply调用 allocate，这是整个程序中最常用的函数之一(这并不奇怪，Haskell 被垃圾收集)。
rts_evalIO最后创建实际线程并等待其完成。因此我们可以估计，仅rts_evalIO 就占用了大约 27%。

所以我们找到了所有一直占用时间的函数。 STG 和 RTS 承担了每次调用 150ns 的开销。

…and is there a way to mitigate this slowdown?

好吧，您的测试 基本上是空操作。您调用它 100000000 次，总运行时间为 15 秒。与 C 版本相比，每次调用的开销约为 149ns。

解决方案非常简单:不要在 C 世界中使用 Haskell 函数来完成琐碎的任务。在正确的情况下使用正确的工具。毕竟，如果您需要添加两个保证小于 10 的数字，则不会使用 GMP 库。

除了这个典型的解决方案:没有。上面显示的程序集是由 GHC 创建的，目前无法在没有 RTS 调用的情况下创建变体。

关于c - 为什么从 C 调用 Haskell 函数会有开销？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31711599/

文章推荐： android - Android 上设置操作按钮图标的名称是什么？

文章推荐： android - 当 minifyEnabled=true 时 GMail API 无法正常工作

文章推荐： c - 为什么通知 “abort” 是非法的？

haskell - Haskell 和类 Haskell 语言之间的类型声明语法差异
在 Haskell 中，类型声明使用双冒号，即 (::)，如 not::Bool -> Bool。但是在许多语法与 Haskell 类似的语言中，例如榆树、 Agda 、他们使用单个冒号(:)来声明
haskell - 在模板 haskell 中运行模板 haskell
insertST :: StateDecoder -> SomeState -> Update SomeState SomeThing insertST stDecoder st = ... Stat
haskell - 在 Haskell ("second order Haskell"中生成 Haskell 类型的工具？
如果这个问题有点含糊，请提前道歉。这是一些周末白日梦的结果。借助 Haskell 出色的类型系统，将数学(尤其是代数)结构表达为类型类是非常令人愉快的。我的意思是，看看 numeric-prelud
haskell - 如何仅使用 Haskell 无休止地运行 Haskell 程序？
我有需要每 5 分钟执行一次的小程序。目前，我有执行该任务的 shell 脚本，但我想通过 CLI 中的键为用户提供无需其他脚本即可运行它的能力。实现这一目标的最佳方法是什么？最佳答案我想你会
haskell - 需要以真实世界 Haskell 风格解决哪些 Haskell 主题？
RWH 面世已经有一段时间了(将近 3 年)。在在线跟踪这本书的渐进式写作之后，我渴望获得我的副本(我认为这是写书的最佳方式之一。)在所有相当学术性的论文中，作为一个 haskell 学生，读起来多么
haskell - 用 Haskell 编写 Haskell 解释器
一个经典的编程练习是用 Lisp/Scheme 编写一个 Lisp/Scheme 解释器。可以利用完整语言的力量来为该语言的子集生成解释器。 Haskell 有类似的练习吗？我想使用 Haskell
haskell - Haskell 中的仿函数定义及其在 Learn You a Haskell 中的解释令人困惑
以下摘自' Learn You a Haskell ' 表示 f 在函数中用作“值的类型”。这是什么意思？即“值的类型”是什么意思？ Int 是“值的类型”，对吗？但是 Maybe 不是“值的类型”
haskell - haskell 中有包含字符串和列表的类型吗？
现在我正在尝试创建一个基本函数，用于删除句子中的所有空格或逗号。 stringToIntList :: [Char] -> [Char] stringToIntList inpt = [ a | a
haskell - 案例中的模式匹配，Haskell
我是 Haskell 的新手，对模式匹配有疑问。这是代码的高度简化版本: data Value = MyBool Bool | MyInt Integer codeDuplicate1 :: Valu
haskell - Haskell 中的这个仿函数是什么意思？
如何解释这个表达式？ :t (+) (+3) (*100) 自和具有相同的优先级并且是左结合的。我认为这与 ((+) (+3)) (*100) 相同.但是，我不知道它的作用。在 Learn
haskell - Haskell 如何计算表达式
这怎么行 > (* 30) 4 120 但这不是 > * 30 40 error: parse error on input ‘*’ 最佳答案 (* 30) 是一个 section，它仍然将 * 视为
haskell - 删除满足谓词的第一个元素(Haskell)
我想创建一个函数，删除满足第二个参数中给定谓词的第一个元素。像这样: removeFirst "abab" ( 'b') = "abab" removeFirst [1,2,3,4] even =
haskell - Haskell 中的内存
Context : def fib(n): if n aand returns a memoized version of the same function. The trick is t
haskell - 惰性评估和严格评估 Haskell
我明白惰性求值是什么，它是如何工作的以及它有什么优势，但是你能解释一下 Haskell 中什么是严格求值吗？我似乎找不到太多关于它的信息，因为惰性评估是最著名的。他们各自的优势是什么。什么时候真正使
haskell - Haskell 中的反向函数行为
digits :: Int -> [Int] digits n = reverse (x) where x | n digits 1234 = [3,1,2,4]
haskell - Haskell 是否支持类型类的匿名实例？
我在 F# 中有以下代码(来自一本书) open System.Collections.Generic type Table = abstract Item : 'T -> 'U with ge
haskell - 使用需要多个输入的过滤器 - Haskell
我对 Haskell 比较陌生，过去几周一直在尝试学习它，但一直停留在过滤器和谓词上，我希望能得到帮助以帮助理解。我遇到了一个问题，我有一个元组列表。每个元组包含一个 (songName, song
haskell - 或采用两个值参数 haskell
我是 haskell 的初学者，我试图为埃拉托色尼筛法定义一个简单的函数，但它说错误: • Couldn't match expected type ‘Bool -> Bool’
haskell - Haskell 中的读取函数
我是 Haskell 语言的新手，我在使用 read 函数时遇到了一些问题。准确地说，我的理解是: read "8.2" + 3.8 应该返回 12.0，因为我们希望返回与第二个成员相同的类型。我真正
haskell - Haskell 声明中的感叹号是什么意思？
当我尝试使用真实项目来驱动它来学习 Haskell 时，我遇到了以下定义。我不明白每个参数前面的感叹号是什么意思，我的书上好像也没有提到。 data MidiMessage = MidiMessage

太空狗

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c - 为什么从 C 调用 Haskell 函数会有开销？

代码

使用 gprof 进行基准测试

What's happening behind?