gpt4 book ai didi

multithreading - Haskell/GHC 每线程内存成本

转载 作者:IT王子 更新时间:2023-10-28 23:29:41 26 4
gpt4 key购买 nike

我正在尝试了解 Haskell(OS X 10.10.5 上的 GHC 7.10.1)中的(绿色)线程到底有多贵。我知道与真正的操作系统线程相比,它的内存使用和 CPU 都非常便宜。

是的,所以我开始用 forks n(绿色)线程(使用优秀的 async 库)编写一个 super 简单的程序,然后让每个线程休眠 m 秒。

嗯,这很容易:

$ cat PerTheadMem.hs 
import Control.Concurrent (threadDelay)
import Control.Concurrent.Async (mapConcurrently)
import System.Environment (getArgs)

main = do
args <- getArgs
let (numThreads, sleep) = case args of
numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
_ -> error "wrong args"
mapConcurrently (\_ -> threadDelay (sleep*1000*1000)) [1..numThreads]

首先,让我们编译并运行它:

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.1
$ ghc -rtsopts -O3 -prof -auto-all -caf-all PerTheadMem.hs
$ time ./PerTheadMem 100000 10 +RTS -sstderr

这应该 fork 100k 个线程并在每个线程中等待 10 秒,然后向我们打印一些信息:

$ time ./PerTheadMem 100000 10 +RTS -sstderr
340,942,368 bytes allocated in the heap
880,767,000 bytes copied during GC
164,702,328 bytes maximum residency (11 sample(s))
21,736,080 bytes maximum slop
350 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed) Avg pause Max pause
Gen 0 648 colls, 0 par 0.373s 0.415s 0.0006s 0.0223s
Gen 1 11 colls, 0 par 0.298s 0.431s 0.0392s 0.1535s

INIT time 0.000s ( 0.000s elapsed)
MUT time 79.062s ( 92.803s elapsed)
GC time 0.670s ( 0.846s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.065s ( 0.091s elapsed)
Total time 79.798s ( 93.740s elapsed)

%GC time 0.8% (0.9% elapsed)

Alloc rate 4,312,344 bytes per MUT second

Productivity 99.2% of total user, 84.4% of total elapsed


real 1m33.757s
user 1m19.799s
sys 0m2.260s

考虑到每个线程应该只等待 10 秒,但我们已经将它构建为非线程的,所以它现在已经足够公平了,这花费了相当长的时间(1 分 33.757 秒)。总而言之,我们使用了 350 MB,这还不错,即每个线程 3.5 KB。鉴于初始堆栈大小 (-ki is 1 KB)。

好的,但现在让我们在线程模式下编译,看看我们是否可以更快:

$ ghc -rtsopts -O3 -prof -auto-all -caf-all -threaded PerTheadMem.hs
$ time ./PerTheadMem 100000 10 +RTS -sstderr
3,996,165,664 bytes allocated in the heap
2,294,502,968 bytes copied during GC
3,443,038,400 bytes maximum residency (20 sample(s))
14,842,600 bytes maximum slop
3657 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed) Avg pause Max pause
Gen 0 6435 colls, 0 par 0.860s 1.022s 0.0002s 0.0028s
Gen 1 20 colls, 0 par 2.206s 2.740s 0.1370s 0.3874s

TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT time 0.000s ( 0.001s elapsed)
MUT time 0.879s ( 8.534s elapsed)
GC time 3.066s ( 3.762s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.074s ( 0.247s elapsed)
Total time 4.021s ( 12.545s elapsed)

Alloc rate 4,544,893,364 bytes per MUT second

Productivity 23.7% of total user, 7.6% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

real 0m12.565s
user 0m4.021s
sys 0m1.154s

哇,快得多,现在只需 12 秒,好多了。从 Activity Monitor 我看到它对于 100k 绿色线程大致使用了 4 个 OS 线程,这是有道理的。

但是,3657 MB 总内存!这比使用的非线程版本多 10 倍......

到目前为止,我没有使用 -prof-hy 左右进行任何分析。为了进一步调查,我在 separate 运行中做了一些堆分析 (-hy)。在这两种情况下,内存使用都没有改变,堆分析图看起来很有趣(左:非线程,右:线程)但我找不到 10 倍差异的原因。 heap profile diffs

比较分析输出(.prof 文件)我也找不到任何真正的区别。 prof diffs

因此我的问题是:内存使用量的 10 倍差异来自哪里?

编辑:顺便提一下:当程序甚至没有使用分析支持进行编译时,同样的差异也适用。所以运行 time ./PerTheadMem 100000 10 +RTS -sstderrghc -rtsopts -threaded -fforce-recomp PerTheadMem.hs 是 3559 MB。而 ghc -rtsopts -fforce-recomp PerTheadMem.hs 为 395 MB。

EDIT 2:在 Linux 上(GHC 7.10.2 on Linux 3.13.0-32-generic#57-Ubuntu SMP, x86_64 )同样的情况:非线程 460 MB 在 1m28.538s 和线程是 3483 MB 是 12.604s。 /usr/bin/time -v ... 报告 Maximum resident set size (kbytes): 413684Maximum resident set size (kbytes): 1645384分别。

EDIT 3:也将程序改为直接使用forkIO:

import Control.Concurrent (threadDelay, forkIO)
import Control.Concurrent.MVar
import Control.Monad (mapM_)
import System.Environment (getArgs)

main = do
args <- getArgs
let (numThreads, sleep) = case args of
numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
_ -> error "wrong args"
mvar <- newEmptyMVar
mapM_ (\_ -> forkIO $ threadDelay (sleep*1000*1000) >> putMVar mvar ())
[1..numThreads]
mapM_ (\_ -> takeMVar mvar) [1..numThreads]

它不会改变任何东西:非线程:152 MB,线程:3308 MB。

最佳答案

恕我直言,罪魁祸首是 threadDelay。 *threadDelay** 使用大量内存。这是一个与您的程序等效的程序,它在内存方面表现更好。它通过长时间运行的计算确保所有线程同时运行。

uBound = 38
lBound = 34

doSomething :: Integer -> Integer
doSomething 0 = 1
doSomething 1 = 1
doSomething n | n < uBound && n > 0 = let
a = doSomething (n-1)
b = doSomething (n-2)
in a `seq` b `seq` (a + b)
| otherwise = doSomething (n `mod` uBound )

e :: Chan Integer -> Int -> IO ()
e mvar i =
do
let y = doSomething . fromIntegral $ lBound + (fromIntegral i `mod` (uBound - lBound) )
y `seq` writeChan mvar y

main =
do
args <- getArgs
let (numThreads, sleep) = case args of
numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
_ -> error "wrong args"
dld = (sleep*1000*1000)
chan <- newChan
mapM_ (\i -> forkIO $ e chan i) [1..numThreads]
putStrLn "All threads created"
mapM_ (\_ -> readChan chan >>= putStrLn . show ) [1..numThreads]
putStrLn "All read"

这里是时间统计:

 $ ghc -rtsopts -O -threaded  test.hs
$ ./test 200 10 +RTS -sstderr -N4

133,541,985,480 bytes allocated in the heap
176,531,576 bytes copied during GC
356,384 bytes maximum residency (16 sample(s))
94,256 bytes maximum slop
4 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed) Avg pause Max pause
Gen 0 64246 colls, 64246 par 1.185s 0.901s 0.0000s 0.0274s
Gen 1 16 colls, 15 par 0.004s 0.002s 0.0001s 0.0002s

Parallel GC work balance: 65.96% (serial 0%, perfect 100%)

TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT time 0.000s ( 0.003s elapsed)
MUT time 63.747s ( 16.333s elapsed)
GC time 1.189s ( 0.903s elapsed)
EXIT time 0.001s ( 0.000s elapsed)
Total time 64.938s ( 17.239s elapsed)

Alloc rate 2,094,861,384 bytes per MUT second

Productivity 98.2% of total user, 369.8% of total elapsed

gc_alloc_block_sync: 98548
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 2

每个线程的最大驻留约为 1.5 kb。我玩了一点线程数和计算的运行长度。由于线程在 forkIO 之后立即开始做事,因此创建 100000 个线程实际上需要很长时间。但结果保持了 1000 个线程。

这是另一个线程延迟被“分解”的程序,这个程序不使用任何 CPU,并且可以轻松地用 100000 个线程执行:

e :: MVar () -> MVar () -> IO ()
e start end =
do
takeMVar start
putMVar end ()

main =
do
args <- getArgs
let (numThreads, sleep) = case args of
numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
_ -> error "wrong args"
starts <- mapM (const newEmptyMVar ) [1..numThreads]
ends <- mapM (const newEmptyMVar ) [1..numThreads]
mapM_ (\ (start,end) -> forkIO $ e start end) (zip starts ends)
mapM_ (\ start -> putMVar start () ) starts
putStrLn "All threads created"
threadDelay (sleep * 1000 * 1000)
mapM_ (\ end -> takeMVar end ) ends
putStrLn "All done"

结果:

     129,270,632 bytes allocated in the heap
404,154,872 bytes copied during GC
77,844,160 bytes maximum residency (10 sample(s))
10,929,688 bytes maximum slop
165 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed) Avg pause Max pause
Gen 0 128 colls, 128 par 0.178s 0.079s 0.0006s 0.0152s
Gen 1 10 colls, 9 par 0.367s 0.137s 0.0137s 0.0325s

Parallel GC work balance: 50.09% (serial 0%, perfect 100%)

TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT time 0.000s ( 0.001s elapsed)
MUT time 0.189s ( 10.094s elapsed)
GC time 0.545s ( 0.217s elapsed)
EXIT time 0.001s ( 0.002s elapsed)
Total time 0.735s ( 10.313s elapsed)

Alloc rate 685,509,460 bytes per MUT second

Productivity 25.9% of total user, 1.8% of total elapsed

在我的 i5 上,创建 100000 个线程并放入“start”mvar 只需不到一秒的时间。每个线程的峰值驻留约为 778 字节,一点也不差!


检查threadDelay的实现,我们发现线程和非线程的情况实际上是不同的:

https://hackage.haskell.org/package/base-4.8.1.0/docs/src/GHC.Conc.IO.html#threadDelay

然后在这里:https://hackage.haskell.org/package/base-4.8.1.0/docs/src/GHC.Event.TimerManager.html

看起来很无辜。但是对于那些调用 threadDelay 的人来说,旧版本的 base 有一个神秘的 (memory) doom 拼写:

https://hackage.haskell.org/package/base-4.4.0.0/docs/src/GHC-Event-Manager.html#line-121

是否还有问题,很难说。然而,人们总是希望“现实生活中的”并发程序不需要有太多线程同时在 threadDelay 上等待。从现在开始,我会留意我对 threadDelay 的使用情况。

关于multithreading - Haskell/GHC 每线程内存成本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33149324/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com