gpt4 book ai didi

java - 来自 UTF-16 编码的错误字节

转载 作者:搜寻专家 更新时间:2023-10-31 19:51:44 25 4
gpt4 key购买 nike

我有一个字符 '😭' Unicode 值是 U+1F62D 等效二进制是 11111011000101101 。现在我想将这个字符转换为字节数组。我的脚步

1) 由于二进制表示大于 2 个字节,我使用 4 个字节

XXXXXXXX XXXXXXX1 11110110 00101101

2) 现在我用“0”替换所有“X”

00000000 00000001 11110110 00101101

3) 十进制数

00000000(0) 00000001(1) 11110110(-10) 00101101(45)

这是我的代码

@Test
public void testUtf16With4Bytes() throws Exception {
assertThat(
new String(
new byte[]{0,1,-10,45},
StandardCharsets.UTF_16BE
),
is("😭")
);
}

这是输出

ava.lang.AssertionError: 
Expected: is "😭"
but: was ""

我错过了什么?

最佳答案

您错过了一些 UTF 字符存储为 surrogate pairs :

In UTF-16, characters in ranges U+0000—U+D7FF and U+E000—U+FFFD are stored as a single 16 bits unit. Non-BMP characters (range U+10000—U+10FFFF) are stored as “surrogate pairs”, two 16 bits units: an high surrogate (in range U+D800—U+DBFF) followed by a low surrogate (in range U+DC00—U+DFFF). A lone surrogate character is invalid in UTF-16, surrogate characters are always written as pairs (high followed by low).

😭 字符是 U+1F62D 所以它属于 U+10000—U+10FFFF 范围。它用代理对 U+D83D U+DE2D 表示,如 byte[] 它将是 [-40, 61, -34, 45]

关于java - 来自 UTF-16 编码的错误字节,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55353274/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com