parsing - 如何使用 Data.ByteString 解析 7GB 文件？-6ren

parsing - 如何使用 Data.ByteString 解析 7GB 文件？

转载作者：行者123 更新时间：2023-12-02 20:58:17

29

4

我必须解析一个文件，实际上必须先读取它，这是我的程序:

import qualified Data.ByteString.Char8 as B
import System.Environment    

main = do
 args      <- getArgs
 let path  =  args !! 0
 content   <- B.readFile path
 let lines = B.lines content
 foobar lines 

 foobar :: [B.ByteString] -> IO()
 foobar _ = return ()

但是，编译之后

> ghc --make -O2 tmp.hs

使用 7GB 文件调用时，执行会出现以下错误。

> ./tmp  big_big_file.dat
> tmp: {handle: big_big_file.dat}: hGet: illegal ByteString size (-1501792951): illegal operation

感谢您的回复!

最佳答案

ByteString的长度是 Int 。如果Int是32位，7GB的文件会超出Int的范围并且缓冲区请求的大小将是错误的，并且很容易请求负大小。

readFile的代码将文件大小转换为Int用于缓冲区请求

readFile :: FilePath -> IO ByteString
readFile f = bracket (openBinaryFile f ReadMode) hClose
    (\h -> hFileSize h >>= hGet h . fromIntegral)

如果溢出，最有可能的结果是“非法字节字符串大小”错误或段错误。

如果可能的话，使用惰性 ByteString s 来处理那么大的文件。就您而言，您几乎必须使其成为可能，因为使用 32 位 Int s，7GB ByteString是不可能创建的。

如果你需要严格的线条 ByteString s 进行处理，并且不会出现过长的行，可以通过懒惰ByteString实现这一目标

import qualified Data.ByteString.Lazy.Char8 as LC
import qualified Data.ByteString.Char8 as C

main = do
    ...
    content <- LC.readFile path
    let llns = LC.lines content
        slns = map (C.concat . LC.toChunks) llns
    foobar slns

但是如果你可以修改你的处理来处理懒惰 ByteString s，总体来说可能会更好。

关于parsing - 如何使用 Data.ByteString 解析 7GB 文件？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/10012106/

29

4

0

文章推荐： java - 在Spring 4.1.5中配置事务而不使用XML

文章推荐： python - 如何建立一个Yaml文件来运行我的python代码

文章推荐： java - 运行Shell脚本作为Spark作业的一部分

文章推荐： Angular、npm、nexus 和 CI 实践

haskell - 无法将预期类型 `Data.ByteString.Internal.ByteString' 与实际类型 `ByteString' 匹配
运行以下代码: import Crypto.BCrypt import Data.ByteString.Lazy.Char8 main = do maybe_pwhash <- hashPassw
http - ByteString 需要不同的 ByteString
此代码不进行类型检查: import Network.HTTP.Conduit import qualified Data.ByteString.Char8 as BS main :: IO () m
bytestring - 是什么让 ByteString IO 如此之快？
我一直在尝试解决problem 1330来自 Haskell 的 acm.timus.ru。基本上，它归结为: 1) 从标准输入中读取一个长度为 N (N < 10^4) 和 M 对整数 (M < 1
string - 如何将StrLn 放入Data.ByteString.Internal.ByteString？
我正在学习 Haskell，并决定尝试编写一些小型测试程序来习惯 Haskell 代码和使用模块。目前我正在尝试使用第一个参数来使用 Cypto.PasswordStore 创建密码哈希。为了测试我的
haskell - Data.ByteString.Lazy.Internal.ByteString 到字符串？
尝试编写一个返回我计算机的外部 IP 地址的模块。使用 Network.Wreq get函数，然后应用一个镜头得到responseBody ，我最终得到的类型是 Data.ByteString.La
haskell - 将 base64-bytestring 与惰性 ByteString 结合使用
这是我在 Haskell 中尝试做的事情: 以 ByteString 格式获取消息(惰性或严格并不重要) 使用 RSA 公钥加密消息对加密消息进行 Base64 编码 RSA library我正在使
haskell - Data.ByteString 和 Data.ByteString.Char8 的区别
我读到 Char8 仅支持 ASCII 字符，如果您使用其他 Unicode 字符，使用起来会很危险 {-# LANGUAGE OverloadedStrings #-} --import quali
haskell - Haskell 中 ByteString 和 ByteString.Lazy 的常用函数
我实现了读取 ByteString 并将其转换为十六进制格式的函数。例如。给定“AA10”，它将其转换为 [170, 16] import qualified Data.ByteString.Laz
haskell - Lazy.ByteString 和 Lazy.Char8.ByteString 之间的区别
我对 real world haskell 中的代码有点困惑 import qualified Data.ByteString.Lazy.Char8 as L8 import qualified Da
string - 在 ByteString 上拆分 ByteString(而不是 Word8 或 Char)
我知道我已经有了 Haskell Data.ByteString.Lazy 函数来拆分单个字符的 CSV，例如: split :: Word8 -> ByteString -> [ByteString
Haskell 从 IO 中提取长度(回复 [Data.ByteString.Internal.ByteString])
需要从已离开的开发人员那里修补 Haskell 项目，但我是一个完整的 Haskell 菜鸟。尝试编写一个函数来返回与某个模式匹配的所有 Redis 键的数量。交互地，它看起来像这样: *MyPro
haskell - 无法将预期类型 'Data.ByteString.Lazy.Internal.ByteString' 与实际类型 '[Char]' 匹配
我正在尝试在我的 Haskell 代码中启动并运行一个简单的 Json 解析器，我遇到了 Data.Aeson，这似乎是解决我的问题的可行解决方案我关注了example code on the pa
haskell - 如何将 Data.ByteString.Char8 转换为 Data.ByteString.Lazy 以在 Data.Binary.Get 中使用？
我正在将包从使用 GHC.IO.Handle 进行网络转换为 Network.Connection.Connection。痛点之一是 Data.ByteString.Lazy.ByteString 更
python - 数据库错误: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings
我正在尝试为 django 应用程序设置数据库。因此，当我尝试创建数据库时，除了一件事之外，一切正常。最后，出现以下消息: You just installed Django's auth syst
python - sqlite3.ProgrammingError : You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings
在 Python 中使用 SQLite3，我正在尝试存储 UTF-8 HTML 代码片段的压缩版本。代码如下: ... c = connection.cursor() c.execute('crea
Python sqlite3.ProgrammingError : You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings 错误
我正在编写一个脚本，它递归地扫描一个目录并将它们存储在一个字典中，该字典是一个列表的集合。该列表中包含具有文件名和文件大小的列表。此文件名可能包含 UTF-8 字符，如下所示。 ['test.rus
Haskell ByteStrings - 最终将大文件加载到内存中
问候，我试图理解为什么我看到整个文件使用以下程序加载到内存中，但是如果您注释掉“(***)”下面的行，那么程序会在恒定(大约 1.5M)空间中运行。编辑:该文件大约 660MB，第 26 列中的字
haskell - ByteString 是否有附加运算符？
对于字符串有 ++ , 它有类型 > :t (++) (++) :: [a] -> [a] -> [a] 显然它不适用于 ByteString因为它不是一个列表。我看到 append 功能，但它有一个
haskell - 有效地创建严格的 ByteStrings
最近在我的项目上运行基准测试后，我发现直接构建严格的字节串可以比涉及构建器的构建快一个数量级。例如，使用构建器的编码器实现: encoder :: Int64 -> Data.ByteString.
haskell - 多种类型的字符串(ByteString)
我希望压缩我的应用程序的网络流量。根据(最新？)"Haskell Popularity Rankings" , zlib似乎是一个非常受欢迎的解决方案。 zlib的接口(interface)使用By

首页

博学

6Ren·AI

商城

parsing - 如何使用 Data.ByteString 解析 7GB 文件？