gpt4 book ai didi

string - 大小爆炸文件与字符串

转载 作者:行者123 更新时间:2023-12-05 06:41:33 25 4
gpt4 key购买 nike

我有一个 261MB 的文本文件(xdebug 输出),当我在其中读取它时,它占用了额外的 2GB 空间动态空间。

(defun stream->string (tmp-stream)
(do ((line (read-line tmp-stream nil nil)
(read-line tmp-stream nil nil))
(lines nil))
((not line) (progn
(FORMAT T "COLLECTED~%")
(FORMAT nil "~{~a~^~%~}" (reverse lines))))
(push line lines)))


(defparameter *test* nil)

(progn
(setf *test* nil)
(sb-ext:gc :full t)
(room)
(FORMAT T "----~%")
(with-open-file (stream "/home/.../debugFiles/xdebug_1.xt")
(room)
(FORMAT T "----~%")
(setf *test* (stream->string stream))
(sb-ext:gc :full t)
(room)
(FORMAT T "----~%"))
(sb-ext:gc :full t)
(room))

输出

Dynamic space usage is:   84,598,224 bytes.
Read-only space usage is: 5,856 bytes.
Static space usage is: 4,160 bytes.
Control stack usage is: 8,408 bytes.
Binding stack usage is: 1,072 bytes.
Control and binding stack usage is for the current thread only.
Garbage collection is currently enabled.

Breakdown for dynamic space:
20,841,808 bytes for 20,691 code objects.
15,989,600 bytes for 999,350 cons objects.
14,532,960 bytes for 118,880 simple-vector objects.
13,951,792 bytes for 168,301 instance objects.
5,994,864 bytes for 41,648 simple-character-string objects.
13,287,200 bytes for 215,901 other objects.
84,598,224 bytes for 1,564,771 dynamic objects (space total.)
----
Dynamic space usage is: 85,346,752 bytes.
Read-only space usage is: 5,856 bytes.
Static space usage is: 4,160 bytes.
Control stack usage is: 8,536 bytes.
Binding stack usage is: 1,072 bytes.
Control and binding stack usage is for the current thread only.
Garbage collection is currently enabled.

Breakdown for dynamic space:
20,842,928 bytes for 20,692 code objects.
16,125,008 bytes for 1,007,813 cons objects.
14,698,784 bytes for 120,834 simple-vector objects.
14,239,440 bytes for 171,411 instance objects.
6,014,144 bytes for 41,776 simple-character-string objects.
13,426,448 bytes for 219,723 other objects.
85,346,752 bytes for 1,582,249 dynamic objects (space total.)
----
COLLECTED
Dynamic space usage is: 2,557,851,296 bytes.
Read-only space usage is: 5,856 bytes.
Static space usage is: 4,160 bytes.
Control stack usage is: 8,536 bytes.
Binding stack usage is: 1,072 bytes.
Control and binding stack usage is for the current thread only.
Garbage collection is currently enabled.

Breakdown for dynamic space:
2,466,544,480 bytes for 817,255 simple-character-string objects.
91,306,816 bytes for 2,303,370 other objects.
2,557,851,296 bytes for 3,120,625 dynamic objects (space total.)
----
Dynamic space usage is: 1,131,069,056 bytes.
Read-only space usage is: 5,856 bytes.
Static space usage is: 4,160 bytes.
Control stack usage is: 8,360 bytes.
Binding stack usage is: 1,072 bytes.
Control and binding stack usage is for the current thread only.
Garbage collection is currently enabled.

Breakdown for dynamic space:
1,053,183,424 bytes for 41,547 simple-character-string objects.
77,885,632 bytes for 1,510,521 other objects.
1,131,069,056 bytes for 1,552,068 dynamic objects (space total.)

我能理解大小的三倍(尽管这仍然会让我感到惊讶):

  1. 线条的集合
  2. format创建的字符串对象
  3. *test*中保存的字符串>

但是,增加 10 倍就太大了。

怎么会这样?

最佳答案

正如 Rainer 指出的那样,您的问题是 sbcl 将 string 表示为 utf32 代码点的向量,这意味着每个字符都是 32 位。

理想情况下,处理文件的正确方法是逐行处理它们,而不是将它们全部放入内存中,但如果这不是您的选择,并且您确信每个角色在您的文件中是一个 base-char 即 ascii 字符,您可以将 :element-type 'base-char 传递给 with-open-file,并将 read-line 的结果强制simple-base-string。这可能看起来像:

(defun file->lines (path)
(with-open-file (stream path :element-type 'base-char)
(do ((line (read-line stream nil nil)
(read-line stream nil nil))
(lines nil))
((not line) (nreverse lines))
(push (coerce line 'simple-base-string) lines))))

另外请注意,如果您的文件有很多行,那么将这些行存储在链表中的开销可能会很大。如果您可以预测文件中的行数,则预分配一个大向量并将行存储在其中可能会有更好的性能,例如:

(defun file->lines (path number-of-lines)
(with-open-file (stream path :element-type 'base-char)
(do ((line (read-line stream nil nil)
(read-line stream nil nil))
(lines (make-array number-of-lines :fill-pointer 0)))
((not line) lines)
(vector-push (coerce line 'simple-base-string) lines))))

但要确保您的number-of-lines 被高估了,否则您可能不得不进行缓慢的重新分配和复制。 (这就是为什么我写 vector-push 而不是 vector-push-extend 的原因。

如果您无法预测多行,您可能最好读入一个列表,然后在末尾强制转换为向量,例如:

(defun file->lines (path)
(with-open-file (stream path :element-type 'base-char)
(do ((line (read-line stream nil nil)
(read-line stream nil nil))
(lines nil))
((not line) (coerce (nreverse lines) 'vector))
(push (coerce line 'simple-base-string) lines))))

关于string - 大小爆炸文件与字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40158743/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com