haskell - GHC 7.10 生成的代码比旧版本慢-6ren

haskell - GHC 7.10 生成的代码比旧版本慢

转载作者：行者123 更新时间：2023-12-03 12:52:10

我意识到最新版本的 GHC (7.10.3) 生成的代码比旧版本慢得多。我目前的版本:

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.3

我的本地机器上还安装了另外两个旧版本。

我的测试代码取自 here ( collatz1.hs 代码):

import Data.Word
import Data.List
import System.Environment

collatzNext :: Word32 -> Word32
collatzNext a = (if even a then a else 3*a+1) `div` 2

-- new code
collatzLen :: Word32 -> Int
collatzLen a0 = lenIterWhile collatzNext (/= 1) a0

lenIterWhile :: (a -> a) -> (a -> Bool) -> a -> Int
lenIterWhile next notDone start = len start 0 where
    len n m = if notDone n
                then len (next n) (m+1)
                else m
-- End of new code

main = do
    [a0] <- getArgs
    let max_a0 = (read a0)::Word32
    print $ maximum $ map (\a0 -> (collatzLen a0, a0)) [1..max_a0]

使用 GHC 7.4、7.6 和 7.10 编译会产生以下时间:

$ ~/Tools/ghc-7.4.2/bin/ghc -O2 Test.hs 
[1 of 1] Compiling Main             ( Test.hs, Test.o )
Linking Test ...

$ time ./Test 1000000
(329,837799)

real    0m1.879s
user    0m1.876s
sys     0m0.000s

$ ~/Tools/ghc-7.6.1/bin/ghc -O2 Test.hs 
[1 of 1] Compiling Main             ( Test.hs, Test.o )
Linking Test ...

$ time ./Test 1000000
(329,837799)

real    0m1.901s
user    0m1.896s
sys     0m0.000s

$ ~/Tools/ghc/bin/ghc -O2 Test.hs 
[1 of 1] Compiling Main             ( Test.hs, Test.o )
Linking Test ...

$ time ./Test 1000000
(329,837799)

real    0m10.562s
user    0m10.528s
sys     0m0.036s

毫无疑问，最新版本的 GHC 生成的代码比旧版本的两个版本更差。我无法重现与博客相同的效率，但可能是因为我没有 LLVM 并且我没有作者使用的确切版本。但是，我相信结论是显而易见的。

我的问题是，总的来说，为什么会发生这种情况？不知何故，GHC 变得比以前更糟了。具体来说，如果我想调查，我应该如何开始？

最佳答案

这是两个配置文件的比较(diff Test-GHC-7-8-4.prof Test-GHC-7-10-3.prof)

1c1                               
<       Fri Mar 11 19:58 2016 Time and Allocation Profiling Report  (Final)
---                               
>       Fri Mar 11 19:59 2016 Time and Allocation Profiling Report  (Final)
5,6c5,6                               
<       total time  =        2.40 secs   (2400 ticks @ 1000 us, 1 processor)
<       total alloc = 256,066,744 bytes  (excludes profiling overheads)
---                               
>       total time  =       10.89 secs   (10895 ticks @ 1000 us, 1 processor)
>       total alloc = 15,713,590,808 bytes  (excludes profiling overheads)
10,13c10,12                               
< lenIterWhile.len Main     93.8   0.0                    
< collatzMax       Main      2.2   93.7
< collatzNext      Main      2.0    0.0
< lenIterWhile     Main      1.5    6.2
---                                
> collatzNext      Main     79.6   89.4
> lenIterWhile.len Main     18.9    8.8
> collatzMax       Main      0.8    1.5

发生了一些非常奇怪的事情。在 GHC 中时 lenIterWhile.len花费了大部分时间， collatzNext现在是罪魁祸首。让我们看一下转储的核心:

-- GHC 7.8.4
Rec {
Main.$wlen [Occ=LoopBreaker]
  :: GHC.Prim.Word# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType <S,1*U><L,U>]
Main.$wlen =
  \ (ww_s4Mn :: GHC.Prim.Word#) (ww1_s4Mr :: GHC.Prim.Int#) ->
    case ww_s4Mn of wild_XQ {
      __DEFAULT ->
        case GHC.Prim.remWord# wild_XQ (__word 2) of _ [Occ=Dead] {
          __DEFAULT ->
            Main.$wlen
              (GHC.Prim.quotWord#
                 (GHC.Prim.narrow32Word#
                    (GHC.Prim.plusWord#
                       (GHC.Prim.narrow32Word# (GHC.Prim.timesWord# (__word 3) wild_XQ))
                       (__word 1)))
                 (__word 2))
              (GHC.Prim.+# ww1_s4Mr 1);
          __word 0 ->
            Main.$wlen
              (GHC.Prim.quotWord# wild_XQ (__word 2)) (GHC.Prim.+# ww1_s4Mr 1)
        };
      __word 1 -> ww1_s4Mr
    }
end Rec }

似乎或多或少是合理的。现在关于 GHC 7.10.3:

Rec {$wlen_r6Sy :: GHC.Prim.Word# -> GHC.Prim.Int# -> GHC.Prim.Int#[GblId, Arity=2, Str=DmdType <S,U><L,U>]$wlen_r6Sy =  \ (ww_s60s :: GHC.Prim.Word#) (ww1_s60w :: GHC.Prim.Int#) ->    case ww_s60s of wild_X1Z {      __DEFAULT ->        case even @ Word32 GHC.Word.$fIntegralWord32 (GHC.Word.W32# wild_X1Z) of _ [Occ=Dead] {          False ->            $wlen_r6Sy              (GHC.Prim.quotWord#                 (GHC.Prim.narrow32Word#                    (GHC.Prim.plusWord#                       (GHC.Prim.narrow32Word# (GHC.Prim.timesWord# (__word 3) wild_X1Z))                       (__word 1)))                 (__word 2))              (GHC.Prim.+# ww1_s60w 1);          True ->            $wlen_r6Sy              (GHC.Prim.quotWord# wild_X1Z (__word 2)) (GHC.Prim.+# ww1_s60w 1)        };      __word 1 -> ww1_s60w    }end Rec }

Allright, seems like it's the same. Except for the call of even. Let's replace even with one of the inline variants of Integral, e.g. x rem 2 == 0:

import Data.Word
import Data.List
import System.Environment

collatzNext :: Word32 -> Word32
collatzNext a = (if a `rem` 2 == 0 then a else 3*a+1) `div` 2

-- rest of code the same

让我们使用 profiling 再次编译它并检查:

$ stack --resolver=ghc-7.10 ghc -- Test.hs -O2 -fforce-recomp -prof -fprof-auto -auto-all
$ ./Test +RTS -s -p -RTS 
(329,837799)
     416,119,240 bytes allocated in the heap
          69,760 bytes copied during GC
          59,368 bytes maximum residency (2 sample(s))
          21,912 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       800 colls,     0 par    0.000s   0.002s     0.0000s    0.0001s
  Gen  1         2 colls,     0 par    0.000s   0.000s     0.0002s    0.0003s

  INIT    time    0.000s  (  0.019s elapsed)
  MUT     time    2.500s  (  2.546s elapsed)
  GC      time    0.000s  (  0.003s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    2.500s  (  2.567s elapsed)

  %GC     time       0.0%  (0.1% elapsed)

  Alloc rate    166,447,696 bytes per MUT second

  Productivity 100.0% of total user, 97.4% of total elapsed

$ cat Test.prof
        Fri Mar 11 20:22 2016 Time and Allocation Profiling Report  (Final)

           Test.exe +RTS -s -p -RTS 1000000

        total time  =        2.54 secs   (2535 ticks @ 1000 us, 1 processor)
        total alloc = 256,066,984 bytes  (excludes profiling overheads)

COST CENTRE      MODULE  %time %alloc

lenIterWhile.len Main     94.4    0.0
main             Main      1.9   93.7
collatzNext      Main      1.8    0.0
lenIterWhile     Main      1.3    6.2

                                                                   individual     inherited
COST CENTRE           MODULE                     no.     entries  %time %alloc   %time %alloc

MAIN                  MAIN                        44           0    0.0    0.0   100.0  100.0
 main                 Main                        89           0    1.9   93.7   100.0  100.0
  main.\              Main                        92     1000000    0.4    0.0    98.1    6.2
   collatzLen         Main                        93     1000000    0.2    0.0    97.8    6.2
    lenIterWhile      Main                        94     1000000    1.3    6.2    97.5    6.2
     lenIterWhile.len Main                        95    88826840   94.4    0.0    96.2    0.0
      collatzNext     Main                        96    87826840    1.8    0.0     1.8    0.0
  main.max_a0         Main                        90           1    0.0    0.0     0.0    0.0
 CAF                  GHC.IO.Encoding.CodePage    73           0    0.0    0.0     0.0    0.0
 CAF                  System.Environment          64           0    0.0    0.0     0.0    0.0
 CAF                  GHC.IO.Handle.Text          62           0    0.0    0.0     0.0    0.0
 CAF                  GHC.IO.Encoding             61           0    0.0    0.0     0.0    0.0

好像这样解决了。所以问题是 GHC-7.8 内联 even ，而 GHC-7.10 没有。这是由于添加了 {-# SPECIALISE even :: x -> x -> Bool #-} Int 的规则和 Integer ，它不允许内联。

如 issue's讨论文件制作 even和 odd {-# INLINEABLE ... #-} 会解决这个问题。请注意，特化本身 was added for perfomance reasons .

关于haskell - GHC 7.10 生成的代码比旧版本慢，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35941674/

文章推荐： javascript - 在客户端通过javascript调用存储过程？

文章推荐： R的data.table截断位？

文章推荐： r - 如何设置安装了 conda 的 R 以与 RStudio 一起使用？

python - 为什么 DataFrame.loc[[1]] 比 df.ix [[1]] 慢 1,800 倍，比 df.loc[1] 慢 3,500 倍？
自己试试看: import pandas as pd s=pd.Series(xrange(5000000)) %timeit s.loc[[0]] # You need pandas 0.15.1
Delphi (DataSnap) 慢
我最近开始使用 Delphi 中的 DataSnap 来生成 RESTful Web 服务。在遵循 Marco Cantu 本人和互联网上其他几个人的指导后，我成功地使整个“链条”正常工作。但是有一
java - 2核Mac上有多个Java线程-慢
我一直在为操作系统类(class)编写以下代码，但结果有些奇怪。该代码创建x线程并同时运行它们，以便将两个平方矩阵相乘。每个线程将输入矩阵的Number_of_rows/Number_of_threa
r - 为什么并行包比只使用apply 慢？
我正在尝试确定何时使用 parallel包以加快运行某些分析所需的时间。我需要做的一件事是创建矩阵，比较具有不同行数的两个数据框中的变量。我在 StackOverflow 上问了一个关于有效方法的问题
haskell - 为什么 <$> 慢？
我最近对我的代码进行了一些清理，并在此过程中更改了此内容(不完全是真实的代码): read = act readSTRef test1 term i var = do t v^!terms.
c# - 分页查询如何*慢*？
我正在计时查询和同一个查询的执行时间，分页。 foreach (var x in productSource.OrderBy(p => p.AdminDisplayName) .Wher
c# - BackgroundWorker 慢
我正在开发一个项目 (WPF)，我有一个 Datagrid 从数据库加载超过 5000 条记录，所以我使用 BackgroundWorker 来通知用户数据正在加载，但它太慢了，我需要等待将近 2分钟
MYSQL 慢 ORDER BY
我在查询中添加 ORDER BY 时遇到问题。没有 ORDER BY 查询大约需要 26ms，一旦我添加 ORDER BY，它大约需要 20s。我尝试了几种不同的方法，但似乎可以减少时间。尝试 F
Android 慢 GridView
我是 Android 开发新手，遇到了性能问题。当我的 GridView 有太多项目时，它会变得有点慢。有什么方法可以让它运行得更快一些吗？这是我使用的代码: 适配器: public class C
java/mysql/慢
这里的要点是: 1.设置query_cache_type = 0;重置查询缓存； 2.在 heidisql(或任何其他客户端 UI)中运行任何查询 --> 执行，例如 45 毫秒 3.使用以下代码运行
PostgreSQL 慢 DISTINCT WHERE
想象下表: CREATE TABLE drops( id BIGSERIAL PRIMARY KEY, loc VARCHAR(5) NOT NULL, tag INT NOT
sql - 慢 WHERE IN 查询结束
我的表 test_table 中的示例数据: date symbol value created_time 2010-01-09 symbol1
php - 很多查询 - 慢？
首先，如果已经有人问过这个问题，我深表歉意，至少我找不到任何东西。无论如何，我将每 5 分钟运行一次 cron 任务。该脚本加载 79 个外部页面，而每个页面包含大约 200 个我需要在数据库中检查
mysql - SQL查询/慢
我有下面的 SQL 代码，它来自 MySQL 数据库。现在它给了我期望的结果，但是查询很慢，我想我应该在进一步之前加快这个查询的速度。表agentstatusinformation有: PKEY(主
ios - 核心数据对象等级(慢)
我需要获取一个对象在 Core Data 中数千个其他对象之间的排名。现在，这是我的代码: - (void)rankMethod { //Fetch all objects NSFet
ios - ABAddressBookCopyArrayOfAllPeople 慢
我正在编写一个应用程序，我需要在其中读取用户的地址簿并显示他所有联系人的列表。我正在测试的 iPhone 有大约 100 个联系人，加载联系人确实需要很多时间。 ABAddressBookRef ad
javascript - InnerHTML 慢？
我正在使用 javascript 将 160 行添加到包含 10 列的表格中。如果我这样做: var cellText = document.createTextNode(value); cell.a
swift - UITableView 慢
我是 Swift 的新手，我已经设置了一个 tableView，它从 JSON 提要中提取数据并将其加载到表中。表格加载正常，但是当表格中有超过 10 个单元格时，它会变得缓慢且有些滞后，特别是它到
c# - 慢 DeterminePostBackMode()
我在 InitializeCulture 和 Page_PreInit 事件之间的 asp.net 页面中遇到性能问题。当我重写 DeterminePostBackMode() 时，我发现问题出在 b
SSL 慢。建立安全连接花费的时间太长
我在 Hetzner 上有一个带有 256GB RAM 6 个 CPU(12 个线程) 的专用服务器，它位于德国。我有 CENTOS 7.5。 EA4。我的问题是 SSL。每天大约 2 小时，我们在

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

haskell - GHC 7.10 生成的代码比旧版本慢