gpt4 book ai didi

performance - 如何优化可以完全严格的循环

转载 作者:行者123 更新时间:2023-12-04 14:31:16 24 4
gpt4 key购买 nike

我正在尝试为 Project Euler Problem #145 编写一个蛮力解决方案,并且我无法让我的解决方案在不到 1 分 30 秒内运行。

(我知道有各种捷径,甚至是纸笔解决方案;出于这个问题的目的,我不考虑这些)。

在我迄今为止提出的最佳版本中,分析显示大部分时间都花在 foldDigits 上。 .这个函数根本不需要偷懒,在我看来应该优化为一个简单的循环。如您所见,我试图使程序的各个部分变得严格。

所以我的问题是:在不改变整体算法的情况下,有什么方法可以将该程序的执行时间降低到亚分钟标记?

(或者如果没有,有没有办法看到 foldDigits 的代码尽可能优化?)

-- ghc -O3 -threaded Euler-145.hs && Euler-145.exe +RTS -N4

{-# LANGUAGE BangPatterns #-}

import Control.Parallel.Strategies

foldDigits :: (a -> Int -> a) -> a -> Int -> a
foldDigits f !acc !n
| n < 10 = i
| otherwise = foldDigits f i d
where (d, m) = n `quotRem` 10
!i = f acc m

reverseNumber :: Int -> Int
reverseNumber !n
= foldDigits accumulate 0 n
where accumulate !v !d = v * 10 + d

allDigitsOdd :: Int -> Bool
allDigitsOdd n
= foldDigits andOdd True n
where andOdd !a d = a && isOdd d
isOdd !x = x `rem` 2 /= 0

isReversible :: Int -> Bool
isReversible n
= notDivisibleByTen n && allDigitsOdd (n + rn)
where rn = reverseNumber n
notDivisibleByTen !x = x `rem` 10 /= 0

countRange acc start end
| start > end = acc
| otherwise = countRange (acc + v) (start + 1) end
where v = if isReversible start then 1 else 0

main
= print $ sum $ parMap rseq cr ranges
where max = 1000000000
qmax = max `div` 4
ranges = [(1, qmax), (qmax, qmax * 2), (qmax * 2, qmax * 3), (qmax * 3, max)]
cr (s, e) = countRange 0 s e

最佳答案

就目前而言,ghc-7.6.1 为 foldDigits 生成的核心(与 -O2 )是

Rec {
$wfoldDigits_r2cK
:: forall a_aha.
(a_aha -> GHC.Types.Int -> a_aha)
-> a_aha -> GHC.Prim.Int# -> a_aha
[GblId, Arity=3, Caf=NoCafRefs, Str=DmdType C(C(S))SL]
$wfoldDigits_r2cK =
\ (@ a_aha)
(w_s284 :: a_aha -> GHC.Types.Int -> a_aha)
(w1_s285 :: a_aha)
(ww_s288 :: GHC.Prim.Int#) ->
case w1_s285 of acc_Xhi { __DEFAULT ->
let {
ds_sNo [Dmd=Just D(D(T)S)] :: (GHC.Types.Int, GHC.Types.Int)
[LclId, Str=DmdType]
ds_sNo =
case GHC.Prim.quotRemInt# ww_s288 10
of _ { (# ipv_aJA, ipv1_aJB #) ->
(GHC.Types.I# ipv_aJA, GHC.Types.I# ipv1_aJB)
} } in
case w_s284 acc_Xhi (case ds_sNo of _ { (d_arS, m_Xsi) -> m_Xsi })
of i_ahg { __DEFAULT ->
case GHC.Prim.<# ww_s288 10 of _ {
GHC.Types.False ->
case ds_sNo of _ { (d_Xsi, m_Xs5) ->
case d_Xsi of _ { GHC.Types.I# ww1_X28L ->
$wfoldDigits_r2cK @ a_aha w_s284 i_ahg ww1_X28L
}
};
GHC.Types.True -> i_ahg
}
}
}
end Rec }

如您所见,它重新装箱了 quotRem 的结果称呼。问题是 f 没有属性在这里可用,作为递归函数, foldDigits不能内联。

使用手动 worker 包装器转换使函数参数静态,
foldDigits :: (a -> Int -> a) -> a -> Int -> a
foldDigits f = go
where
go !acc 0 = acc
go acc n = case n `quotRem` 10 of
(q,r) -> go (f acc r) q
foldDigits变为可内联的,您将获得专门的版本,供您对未装箱数据进行操作,但没有顶级 foldDigits ,例如
Rec {
$wgo_r2di :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType LL]
$wgo_r2di =
\ (ww_s28F :: GHC.Prim.Int#) (ww1_s28J :: GHC.Prim.Int#) ->
case ww1_s28J of ds_XJh {
__DEFAULT ->
case GHC.Prim.quotRemInt# ds_XJh 10
of _ { (# ipv_aJK, ipv1_aJL #) ->
$wgo_r2di (GHC.Prim.+# (GHC.Prim.*# ww_s28F 10) ipv1_aJL) ipv_aJK
};
0 -> ww_s28F
}
end Rec }

并且对计算时间的影响是有形的,对于原始的,我得到了

$ ./eul145 +RTS -s -N2
608720
1,814,289,579,592 bytes allocated in the heap
196,407,088 bytes copied during GC
47,184 bytes maximum residency (2 sample(s))
30,640 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed) Avg pause Max pause
Gen 0 1827331 colls, 1827331 par 23.77s 11.86s 0.0000s 0.0041s
Gen 1 2 colls, 1 par 0.00s 0.00s 0.0001s 0.0001s

Parallel GC work balance: 54.94% (serial 0%, perfect 100%)

TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)

SPARKS: 4 (3 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)

INIT time 0.00s ( 0.00s elapsed)
MUT time 620.52s (313.51s elapsed)
GC time 23.77s ( 11.86s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 644.29s (325.37s elapsed)

Alloc rate 2,923,834,808 bytes per MUT second

(我使用 -N2 因为我的 i5 只有两个物理内核),vs.

$ ./eul145 +RTS -s -N2
608720
16,000,063,624 bytes allocated in the heap
403,384 bytes copied during GC
47,184 bytes maximum residency (2 sample(s))
30,640 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed) Avg pause Max pause
Gen 0 15852 colls, 15852 par 0.34s 0.17s 0.0000s 0.0037s
Gen 1 2 colls, 1 par 0.00s 0.00s 0.0001s 0.0001s

Parallel GC work balance: 43.86% (serial 0%, perfect 100%)

TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)

SPARKS: 4 (3 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)

INIT time 0.00s ( 0.00s elapsed)
MUT time 314.85s (160.08s elapsed)
GC time 0.34s ( 0.17s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 315.20s (160.25s elapsed)

Alloc rate 50,817,657 bytes per MUT second

Productivity 99.9% of total user, 196.5% of total elapsed

与修改。运行时间大约减半,分配减少了 100 倍。

关于performance - 如何优化可以完全严格的循环,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13252992/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com