gpt4 book ai didi

haskell - 如何在资源有限的 Haskell 中解析大型 XML 文件?

转载 作者:行者123 更新时间:2023-12-04 21:13:34 25 4
gpt4 key购买 nike

我想从 Haskell 中的一个大型 XML 文件(大约 20G)中提取信息。由于它是一个大文件,我使用了 Hexpath 中的 SAX 解析函数.
这是我测试的一个简单代码:

import qualified Data.ByteString.Lazy as L
import Text.XML.Expat.SAX as Sax

parse :: FilePath -> IO ()
parse path = do
inputText <- L.readFile path
let saxEvents = Sax.parse defaultParseOptions inputText :: [SAXEvent Text Text]
let txt = foldl' processEvent "" saxEvents
putStrLn txt
在 Cabal 中激活分析后,它说 parse.saxEvents占用了 85% 的分配内存。我也用过 foldr结果是一样的。
如果 processEvent变得足够复杂,程序崩溃了 stack space overflow错误。
我究竟做错了什么?

最佳答案

你不说什么processEvent就好像。原则上,使用惰性 ByteString 应该没有问题。对于延迟生成的输入进行严格的左折叠,所以我不确定你的情况出了什么问题。但是在处理巨大的文件时应该使用适合流的类型!

事实上,hexpat确实有“流”接口(interface)(就像 xml-conduit )。它使用不太知名的 List 图书馆和 the rather ugly List class it defines .原则上 ListT type来自 List 包应该可以正常工作。由于缺少组合器,我很快放弃了,并写了一个丑陋的 List 的适当实例。 Pipes.ListT 的包装版本的类然后我用它来导出普通的Pipes.Producer函数如 parseProduce .为此所需的琐碎操作在下面附加为 PipesSax.hs
一旦我们有 parseProducer我们可以将 ByteString 或 Text Producer 转换为 SaxEvents 的 Producer带有 Text 或 ByteString 组件。下面是一些简单的操作。我使用的是 238M 的“input.xml”;程序永远不需要超过 6 mb 的内存,从查看 top 来判断.

-- Sax.hs大多数 IO 操作使用 registerIds在底部定义的管道是为一大段 xml 量身定制的,这是一个有效的 1000 片段 http://sprunge.us/WaQK

{-#LANGUAGE OverloadedStrings #-}
import PipesSax ( parseProducer )
import Data.ByteString ( ByteString )
import Text.XML.Expat.SAX
import Pipes -- cabal install pipes pipes-bytestring
import Pipes.ByteString (toHandle, fromHandle, stdin, stdout )
import qualified Pipes.Prelude as P
import qualified System.IO as IO
import qualified Data.ByteString.Char8 as Char8

sax :: MonadIO m => Producer ByteString m ()
-> Producer (SAXEvent ByteString ByteString) m ()
sax = parseProducer defaultParseOptions

-- stream xml from stdin, yielding hexpat tagstream to stdout;
main0 :: IO ()
main0 = runEffect $ sax stdin >-> P.print

-- stream the extracted 'IDs' from stdin to stdout
main1 :: IO ()
main1 = runEffect $ sax stdin >-> registryIds >-> stdout

-- write all IDs to a file
main2 =
IO.withFile "input.xml" IO.ReadMode $ \inp ->
IO.withFile "output.txt" IO.WriteMode $ \out ->
runEffect $ sax (fromHandle inp) >-> registryIds >-> toHandle out

-- folds:
-- print number of IDs
main3 = IO.withFile "input.xml" IO.ReadMode $ \inp ->
do n <- P.length $ sax (fromHandle inp) >-> registryIds
print n

-- sum the meaningful part of the IDs - a dumb fold for illustration
main4 = IO.withFile "input.xml" IO.ReadMode $ \inp ->
do let pipeline = sax (fromHandle inp) >-> registryIds >-> P.map readIntId
n <- P.fold (+) 0 id pipeline
print n
where
readIntId :: ByteString -> Integer
readIntId = maybe 0 (fromIntegral.fst) . Char8.readInt . Char8.drop 2

-- my xml has tags with attributes that appear via hexpat thus:
-- StartElement "FacilitySite" [("registryId","110007915364")]
-- and the like. This is just an arbitrary demo stream manipulation.
registryIds :: Monad m => Pipe (SAXEvent ByteString ByteString) ByteString m ()
registryIds = do
e <- await -- we look for a 'SAXEvent'
case e of -- if it matches, we yield, else we go to the next event
StartElement "FacilitySite" [("registryId",a)] -> do yield a
yield "\n"
registryIds
_ -> registryIds

--'库':PipesSax.hs

这只是 newtypes Pipes.ListT 以获取适当的实例。我们不导出与 List 相关的任何内容或 ListT但只需使用标准 Pipes.Producer 概念。
{-#LANGUAGE TypeFamilies, GeneralizedNewtypeDeriving #-}
module PipesSax (parseProducerLocations, parseProducer) where
import Data.ByteString (ByteString)
import Text.XML.Expat.SAX
import Data.List.Class
import Control.Monad
import Control.Applicative
import Pipes
import qualified Pipes.Internal as I

parseProducer
:: (Monad m, GenericXMLString tag, GenericXMLString text)
=> ParseOptions tag text
-> Producer ByteString m ()
-> Producer (SAXEvent tag text) m ()
parseProducer opt = enumerate . enumerate_
. parseG opt
. Select_ . Select

parseProducerLocations
:: (Monad m, GenericXMLString tag, GenericXMLString text)
=> ParseOptions tag text
-> Producer ByteString m ()
-> Producer (SAXEvent tag text, XMLParseLocation) m ()
parseProducerLocations opt =
enumerate . enumerate_ . parseLocationsG opt . Select_ . Select

newtype ListT_ m a = Select_ { enumerate_ :: ListT m a }
deriving (Functor, Monad, MonadPlus, MonadIO
, Applicative, Alternative, Monoid, MonadTrans)

instance Monad m => List (ListT_ m) where
type ItemM (ListT_ m) = m
joinL = Select_ . Select . I.M . liftM (enumerate . enumerate_)
runList = liftM emend . next . enumerate . enumerate_
where
emend (Right (a,q)) = Cons a (Select_ (Select q))
emend _ = Nil

关于haskell - 如何在资源有限的 Haskell 中解析大型 XML 文件?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29450397/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com