I have not worked with an entirely custom file format yet, but the project I am working on requires an entirely new, custom binary file format. I don't know all the best practices for the same (like using Identification bytes aka "magic numbers") and how to implement them in Python. Here are the basic requirements:
我还没有使用过完全定制的文件格式,但我正在处理的项目需要一种全新的定制二进制文件格式。我不知道这方面的所有最佳实践(比如使用标识字节,也就是“魔术数字”),也不知道如何在Python中实现它们。以下是基本要求:
- I have a dictionary which must be used as metadata regarding the file (which will probably be serialized, I guess).
- The main body of the file will contain random bytes because they are the result of encryption.
I have to read the metadata whenever such a file is provided, and get back my original Python dictionary, and then I need to retrieve the body i.e., the random bytes, for decryption. Kindly provide a basic implementation or an idea to read and write such a file in Python along with the best practices to create the custom file format.
每当提供这样的文件时,我都必须读取元数据,并取回我的原始Python词典,然后我需要检索正文,即随机字节,以进行解密。请提供一个基本的实现或一个想法,以读写这样的文件在Python中,以及创建自定义文件格式的最佳实践。
Currently, as a temporary solution, I am serializing the dictionary using ormsgpack and prepending it to the output file, then using a custom delimiter b"\xFF\xFF\xFF\xFF"
to separate the serialized metadata from the main body.
目前,作为临时解决方案,我使用ormsgpack序列化字典并将其前置到输出文件,然后使用定制分隔符b“\xff\xff”将序列化的元数据与主体分开。
|‾‾‾‾‾‾‾‾‾‾‾|
| metadata |
|___________|
| |
| delimiter |
|___________|
| |
| body |
|___________|
However, this might be an issue since if this particular sequence occurs somewhere in the serialized metadata, the full metadata will not be read and cause errors.
但是,这可能是一个问题,因为如果此特定序列出现在序列化的元数据中的某个位置,则不会读取完整的元数据并导致错误。
更多回答
优秀答案推荐
Using msgpack is a good idea.
使用msgpack是个好主意。
Right after serializing, check the length of the output, and prepend it:
在序列化之后,立即检查输出的长度,并将其添加到前面:
|‾‾‾‾‾‾‾‾‾‾‾|
| length |
| (8 bytes) |
| |
|‾‾‾‾‾‾‾‾‾‾‾|
| metadata |
|___________|
| |
| body |
|___________|
That's the way most protocols work. Decoding this will then be easier.
这是大多数协议的工作方式。这样一来,破译就更容易了。
If the metadata is potentially too big for memory, you can set the length to 0, write the metadata, and then seek back to change the length.
如果元数据可能太大而无法存储,您可以将长度设置为0,写入元数据,然后返回以更改长度。
A different option would be to escape the delimiter sequence, but that's more complex and won't be as useful in this scenario.
另一种选择是转义分隔符序列,但这更复杂,在此场景中不会那么有用。
更多回答
我是一名优秀的程序员,十分优秀!