gpt4 book ai didi

go - golang中的utf8第二字节下界

转载 作者:IT王子 更新时间:2023-10-29 01:26:21 24 4
gpt4 key购买 nike

最近在刷utf8解码的go源码。显然在解码 utf8 字节时,第一个字节的值为 224(0xE0) 它映射到接受范围 [0xA0; 0xBF]。 https://github.com/golang/go/blob/master/src/unicode/utf8/utf8.go#L81 https://github.com/golang/go/blob/master/src/unicode/utf8/utf8.go#L94

如果我正确理解 utf8 规范 ( https://www.rfc-editor.org/rfc/rfc3629 ),每个连续字节的最小值为 0x80 或 1000 0000。为什么 0xE0 的起始字节的最小值更高,即 0xA0 而不是 0x80?

最佳答案

原因是为了防止所谓的超长序列。引用 RFC:

Implementations of the decoding algorithm above MUST protect against decoding invalid sequences. For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding invalid sequences may have security consequences or cause other problems.

[...]

A particularly subtle form of this attack can be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters. For example, a parser might prohibit the NUL character when encoded as the single-octet sequence 00, but erroneously allow the illegal two-octet sequence C0 80 and interpret it as a NUL character. Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F C0 AE 2E 2F. This last exploit has actually been used in a widespread virus attacking Web servers in 2001; thus, the security threat is very real.

另请注意第 4 节中的语法规则,它明确只允许在 E0 之后使用字符 A0-BF:

UTF8-2      = <b>%xC2-DF</b> UTF8-tail  
UTF8-3 = <b>%xE0 %xA0-BF UTF8-tail</b> / %xE1-EC 2( UTF8-tail ) /
%xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4 = <b>%xF0 %x90-BF 2( UTF8-tail )</b> / %xF1-F3 3( UTF8-tail ) /
%xF4 %x80-8F 2( UTF8-tail )

关于go - golang中的utf8第二字节下界,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47769542/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com