gpt4 book ai didi

haskell - 用haskell构建直方图,比python慢​​很多倍

转载 作者:行者123 更新时间:2023-12-03 15:27:14 27 4
gpt4 key购买 nike

我打算测试朴素贝叶斯分类。其中一部分是构建训练数据的直方图。问题是,我使用了大量的训练数据,几年前的 haskell-cafe 邮件列表,文件夹中有超过 20k 的文件。

使用 python 创建直方图需要 2 分钟多一点,使用 haskell 需要 8 分钟多一点。我正在使用 Data.Map (insertWith')、枚举器和文本。我还能做些什么来加快程序的速度?

haskell :

import qualified Data.Text as T
import qualified Data.Text.IO as TI
import System.Directory
import Control.Applicative
import Control.Monad (filterM, foldM)
import System.FilePath.Posix ((</>))
import qualified Data.Map as M
import Data.Map (Map)
import Data.List (foldl')
import Control.Exception.Base (bracket)
import System.IO (Handle, openFile, hClose, hSetEncoding, IOMode(ReadMode), latin1)
import qualified Data.Enumerator as E
import Data.Enumerator (($$), (>==>), (<==<), (==<<), (>>==), ($=), (=$))
import qualified Data.Enumerator.List as EL
import qualified Data.Enumerator.Text as ET



withFile' :: (Handle -> IO c) -> FilePath -> IO c
withFile' f fp = do
bracket
(do
h ← openFile fp ReadMode
hSetEncoding h latin1
return h)
hClose
(f)

buildClassHistogram c = do
files ← filterM doesFileExist =<< map (c </> ) <$> getDirectoryContents c
foldM fileHistogram M.empty files

fileHistogram m file = withFile' (λh → E.run_ $ enumHist h) file
where
enumHist h = ET.enumHandle h $$ EL.fold (λm' l → foldl' (λm'' w → M.insertWith' (const (+1)) w 1 m'') m' $ T.words l) m

Python:
for filename in listdir(root):
filepath = root + "/" + filename
# print(filepath)
fp = open(filepath, "r", encoding="latin-1")
for word in fp.read().split():
if word in histogram:
histogram[word] = histogram[word]+1
else:
histogram[word] = 1

编辑 : 添加进口

最佳答案

您可以尝试使用 hashtables 包中的命令式哈希映射:http://hackage.haskell.org/package/hashtables
我记得与 Data.Map 相比,我曾经获得了适度的加速。不过,我不希望有什么壮观的。

更新

我简化了你的 python 代码,所以我可以在一个大文件(1 亿行)上测试它:

import sys
histogram={}
for word in sys.stdin.readlines():
if word in histogram:
histogram[word] = histogram[word]+1
else:
histogram[word] = 1
print histogram.get("the")

耗时 6.06 秒

使用哈希表的 Haskell 翻译:
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString.Char8 as T
import qualified Data.HashTable.IO as HT
main = do
ls <- T.lines `fmap` T.getContents
h <- HT.new :: IO (HT.BasicHashTable T.ByteString Int)
flip mapM_ ls $ \w -> do
r <- HT.lookup h w
case r of
Nothing -> HT.insert h w (1::Int)
Just c -> HT.insert h w (c+1)
HT.lookup h "the" >>= print

以大分配区域运行: histogram +RTS -A500M耗时 9.3 秒,GC 为 2.4%。仍然比 Python 慢很多,但也不算太糟糕。

根据 GHC user guide ,您可以在编译时更改 RTS 选项:

GHC lets you change the default RTS options for a program at compile time, using the -with-rtsopts flag (Section 4.12.6, “Options affecting linking”). A common use for this is to give your program a default heap and/or stack size that is greater than the default. For example, to set -H128m -K64m, link with -with-rtsopts="-H128m -K64m".

关于haskell - 用haskell构建直方图,比python慢​​很多倍,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9772098/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com