gpt4 book ai didi

haskell - 字节串直方图

转载 作者:行者123 更新时间:2023-12-02 17:29:01 28 4
gpt4 key购买 nike

给定一个(严格的)ByteString,计算它包含的每个可能字节的最有效方法是什么?

我发现 sort 应该作为计数排序来实现 - 但似乎没有办法访问关联的计数。我还看到有一个 count 函数,它计算给定字节出现的次数。这给了我以下选项:

  • map (\b -> count b str) [0x00 .. 0xFF]
  • map 长度。团体 。排序
  • 带有 fold* 和字节频率 IntMap 的东西。

哪一个可能会给我带来最好的性能?

最佳答案

dflemstr的基本思想当然是正确的,但由于您想要最佳性能,因此需要对 ByteString 以及计数数组使用未经检查的访问,例如

import Control.Monad.ST
import Data.Array.ST
import Data.Array.Base (unsafeRead, unsafeWrite)
import Data.Array.Unboxed

import Data.Word

import Data.ByteString (ByteString)
import qualified Data.ByteString as BS
import Data.ByteString.Unsafe

histogram :: ByteString -> UArray Word8 Int
histogram bs = runSTUArray $ do
hist <- newArray (0, 255) 0
let l = BS.length bs
update b = do
o <- unsafeRead hist b
unsafeWrite hist b (o+1)
loop i
| i < 0 = return hist
| otherwise = do
update $ fromIntegral (bs `unsafeIndex` i)
loop (i-1)
loop (l-1)

根据criterion(构建 200000 长 ByteString 的直方图),这会产生相当大的差异:

warming up
estimating clock resolution...
mean is 1.667687 us (320001 iterations)
found 3078 outliers among 319999 samples (1.0%)
1947 (0.6%) high severe
estimating cost of a clock call...
mean is 40.43765 ns (14 iterations)

benchmarking dflemstr
mean: 21.42852 ms, lb 21.05213 ms, ub 21.77954 ms, ci 0.950
std dev: 1.873897 ms, lb 1.719565 ms, ub 2.038779 ms, ci 0.950
variance introduced by outliers: 74.820%
variance is severely inflated by outliers

benchmarking unsafeIndex
mean: 312.6447 us, lb 304.3425 us, ub 321.0795 us, ci 0.950
std dev: 42.86886 us, lb 39.64363 us, ub 46.52899 us, ci 0.950
variance introduced by outliers: 88.342%
variance is severely inflated by outliers

(我将 dflemstr 的代码更改为也使用 runSTUArray 并返回 UArray Word8 Int 以获得 uiform 返回值,这不会对运行时间产生太大影响,不过。)

关于haskell - 字节串直方图,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17124687/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com