gpt4 book ai didi

haskell - Haskell 中的堆溢出

转载 作者:行者123 更新时间:2023-12-02 02:47:46 30 4
gpt4 key购买 nike

我收到 Heap exhausted在足够大的数据集上运行以下简短的 Haskell 程序时的消息。例如,程序在大约 900k 行的 20 Mb 输入文件上失败(堆溢出)。堆大小(通过 -with-rtsopts)设置为 1 Gb。如果 longestCommonSubstrB 则运行正常被定义为更简单的东西,例如commonPrefix .我需要处理大约 100 Mb 的文件。

我使用以下命令行(GHC 7.8.3)编译了程序:

ghc -Wall -O2 -prof -fprof-auto "-with-rtsopts=-M512M -p -s -h -i0.1" SampleB.hs

我会很感激任何帮助让这个东西在合理的空间中运行(按照输入文件大小的顺序),但我特别感激寻找瓶颈在哪里以及在哪里以及如何强制严格的思考过程。

我的猜测是,以某种方式强制 longestCommonSubstrB严格评估的功能可以解决问题,但我不知道该怎么做。

{-# LANGUAGE BangPatterns #-}
module Main where
import System.Environment (getArgs)
import qualified Data.ByteString.Lazy.Char8 as B
import Data.List (maximumBy, sort)
import Data.Function (on)
import Data.Char (isSpace)

-- | Returns a list of lexicon items, i.e. [[w1,w2,w3]]
readLexicon :: FilePath -> IO [[B.ByteString]]
readLexicon filename = do
text <- B.readFile filename
return $ map (B.split '\t' . stripR) . B.lines $ text
where
stripR = B.reverse . B.dropWhile isSpace . B.reverse

transformOne :: [B.ByteString] -> B.ByteString
transformOne (w1:w2:w3:[]) =
B.intercalate (B.pack "|") [w1, longestCommonSubstrB w2 w1, w3]
transformOne a = error $ "transformOne: unexpected tuple " ++ show a

longestCommonSubstrB :: B.ByteString -> B.ByteString -> B.ByteString
longestCommonSubstrB xs ys = maximumBy (compare `on` B.length) . concat $
[f xs' ys | xs' <- B.tails xs] ++
[f xs ys' | ys' <- tail $ B.tails ys]
where f xs' ys' = scanl g B.empty $ B.zip xs' ys'
g z (x, y) = if x == y
then z `B.snoc` x
else B.empty

main :: IO ()
main = do
(input:output:_) <- getArgs
lexicon <- readLexicon input
let flattened = B.unlines . sort . map transformOne $ lexicon
B.writeFile output flattened

这是测试数据集的配置文件输出(100k 行,堆大小设置为 1 GB,即 generateSample.exe 100000,生成的文件大小为 2.38 MB):

随时间变化的堆配置文件:

Memory usage over time

执行统计:
   3,505,737,588 bytes allocated in the heap
785,283,180 bytes copied during GC
62,390,372 bytes maximum residency (44 sample(s))
216,592 bytes maximum slop
96 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed) Avg pause Max pause
Gen 0 6697 colls, 0 par 1.05s 1.03s 0.0002s 0.0013s
Gen 1 44 colls, 0 par 4.14s 3.99s 0.0906s 0.1935s

INIT time 0.00s ( 0.00s elapsed)
MUT time 7.80s ( 9.17s elapsed)
GC time 3.75s ( 3.67s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 1.44s ( 1.35s elapsed)
EXIT time 0.02s ( 0.00s elapsed)
Total time 13.02s ( 12.85s elapsed)

%GC time 28.8% (28.6% elapsed)

Alloc rate 449,633,678 bytes per MUT second

Productivity 60.1% of total user, 60.9% of total elapsed

时间和分配分析报告:
       SampleB.exe +RTS -M1G -p -s -h -i0.1 -RTS sample.txt sample_out.txt

total time = 3.97 secs (3967 ticks @ 1000 us, 1 processor)
total alloc = 2,321,595,564 bytes (excludes profiling overheads)

COST CENTRE MODULE %time %alloc

longestCommonSubstrB Main 43.3 33.1
longestCommonSubstrB.f Main 21.5 43.6
main.flattened Main 17.5 5.1
main Main 6.6 5.8
longestCommonSubstrB.g Main 5.0 5.8
readLexicon Main 2.5 2.8
transformOne Main 1.8 1.7
readLexicon.stripR Main 1.8 1.9


individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc

MAIN MAIN 45 0 0.1 0.0 100.0 100.0
main Main 91 0 6.6 5.8 99.9 100.0
main.flattened Main 93 1 17.5 5.1 89.1 89.4
transformOne Main 95 100000 1.8 1.7 71.6 84.3
longestCommonSubstrB Main 100 100000 43.3 33.1 69.8 82.5
longestCommonSubstrB.f Main 101 1400000 21.5 43.6 26.5 49.5
longestCommonSubstrB.g Main 104 4200000 5.0 5.8 5.0 5.8
readLexicon Main 92 1 2.5 2.8 4.2 4.8
readLexicon.stripR Main 98 0 1.8 1.9 1.8 1.9
CAF GHC.IO.Encoding.CodePage 80 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 74 0 0.0 0.0 0.0 0.0
CAF GHC.IO.FD 70 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 66 0 0.0 0.0 0.0 0.0
CAF System.Environment 65 0 0.0 0.0 0.0 0.0
CAF Data.ByteString.Lazy.Char8 54 0 0.0 0.0 0.0 0.0
CAF Main 52 0 0.0 0.0 0.0 0.0
transformOne Main 99 0 0.0 0.0 0.0 0.0
readLexicon Main 96 0 0.0 0.0 0.0 0.0
readLexicon.stripR Main 97 1 0.0 0.0 0.0 0.0
main Main 90 1 0.0 0.0 0.0 0.0

更新:以下程序可用于生成样本数据。它需要一个参数,即生成的数据集中的行数。生成的数据会保存到 sample.txt文件。当我用它生成 900k 行数据集时(通过运行 generateSample.exe 900000 ),生成的数据集使上述程序因堆溢出而失败(堆大小设置为 1 GB)。生成的数据集约为 20 MB。

module Main where 
import System.Environment (getArgs)
import Data.List (intercalate, permutations)

generate :: Int -> [(String,String,String)]
generate n = take n $ zip3 (f "banana") (f "ruanaba") (f "kikiriki")
where
f = cycle . permutations

main :: IO ()
main = do
(n:_) <- getArgs
let flattened = unlines . map f $ generate (read n :: Int)
writeFile "sample.txt" flattened
where
f (w1,w2,w3) = intercalate "\t" [w1, w2, w3]

最佳答案

在我看来,您已经实现了一个天真的最长公共(public)子字符串,具有可怕的空间复杂度(至少 O(n^2))。严格与它无关。

你会想要实现一个动态编程算法。您可以在 string-similarity 中找到灵感包,或在 lcsDiff 的内脏中发挥作用包裹。

关于haskell - Haskell 中的堆溢出,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32520595/

30 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com