gpt4 book ai didi

具有两种不同字符集的 Java String.getBytes(charset) 和 new String(bytes, charset)

转载 作者:塔克拉玛干 更新时间:2023-11-02 19:06:02 28 4
gpt4 key购买 nike

据我所知,在 String.getBytes(charset) 中,参数 charset 表示该方法返回编码为给定字符集的字符串字节。

在 new String(bytes, charset) 中,第二个参数 charset 表示该方法将字节解码为给定的字符集,并返回解码结果。

根据上述,以及我的理解,两种不同方法的字符集参数必须相同,这样 new String(bytes, charset) 才能返回正确的字符串。 (我想这就是我所缺少的。)

我有一个错误解码的字符串,我用它测试了以下代码:

String originalStr = "Å×½ºÆ®"; // 테스트 
String [] charSet = {"utf-8","euc-kr","ksc5601","iso-8859-1","x-windows-949"};

for (int i=0; i<charSet.length; i++) {
for (int j=0; j<charSet.length; j++) {
try {
System.out.println("[" + charSet[i] +"," + charSet[j] +"] = " + new String(originalStr.getBytes(charSet[i]), charSet[j]));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
}
}

输出是:

[utf-8,utf-8] = Å×½ºÆ®
[utf-8,euc-kr] = ��쩍쨘�짰
[utf-8,ksc5601] = ��쩍쨘�짰
[utf-8,iso-8859-1] = Å×½ºÆ®
[utf-8,x-windows-949] = 횇횞쩍쨘횈짰
[euc-kr,utf-8] = ?����������
[euc-kr,euc-kr] = ?×½ºÆ®
[euc-kr,ksc5601] = ?×½ºÆ®
[euc-kr,iso-8859-1] = ?¡¿¨ö¨¬¨¡¢ç
[euc-kr,x-windows-949] = ?×½ºÆ®
[ksc5601,utf-8] = ?����������
[ksc5601,euc-kr] = ?×½ºÆ®
[ksc5601,ksc5601] = ?×½ºÆ®
[ksc5601,iso-8859-1] = ?¡¿¨ö¨¬¨¡¢ç
[ksc5601,x-windows-949] = ?×½ºÆ®
[iso-8859-1,utf-8] = �׽�Ʈ
[iso-8859-1,euc-kr] = 테스트
[iso-8859-1,ksc5601] = 테스트
[iso-8859-1,iso-8859-1] = Å×½ºÆ®
[iso-8859-1,x-windows-949] = 테스트
[x-windows-949,utf-8] = ?����������
[x-windows-949,euc-kr] = ?×½ºÆ®
[x-windows-949,ksc5601] = ?×½ºÆ®
[x-windows-949,iso-8859-1] = ?¡¿¨ö¨¬¨¡¢ç
[x-windows-949,x-windows-949] = ?×½ºÆ®

如你所见,我想出了获取原始字符串的方法:

[iso-8859-1,euc-kr] = 테스트  
[iso-8859-1,ksc5601] = 테스트
[iso-8859-1,x-windows-949] = 테스트

这怎么可能?如何将字符串作为不同的字符集正确编码和解码?

最佳答案

According to the above, and as my understanding, the charset arguments of the two different methods must be the same so that new String(bytes, charset) can return a proper string.

这就是您应该瞄准的目标,即编写正确的代码。但这并不意味着每一次错误的操作都会产生错误的结果。一个简单的示例是仅由 ASCII 字母组成的字符串。许多编码为此类字符串生成相同的字节序列,因此仅使用此类字符串的测试不足以发现与编码相关的错误。

As you can see, I figure out the way of getting the original string:

[iso-8859-1,euc-kr] = 테스트  
[iso-8859-1,ksc5601] = 테스트
[iso-8859-1,x-windows-949] = 테스트

How can it be possible? How can the string be encoded and decoded properly as different character sets?

好吧,当我执行

System.out.println(Charset.forName("euc-kr") == Charset.forName("ksc5601"));

在我的机器上,它打印出 true。或者,如果我执行

System.out.println(Charset.forName("euc-kr").aliases());

它打印

[ksc5601-1987, csEUCKR, ksc5601_1987, ksc5601, 5601, euc_kr, ksc_5601, ks_c_5601-1987, euckr]

所以对于euc-krksc5601,答案很简单。这些是相同字符编码的不同名称。

对于 x-windows-949,我必须求助于 Wikipedia :

Unified Hangul Code (UHC), or Extended Wansung, also known under Microsoft Windows as Code Page 949 (Windows-949, MS949 or ambiguously CP949), is the Microsoft Windows code page for the Korean language. It is an extension of Wansung Code (KS C 5601:1987, encoded as EUC-KR) to include all 11172 Hangul syllables present in Johab (KS C 5601:1992 annex 3).

所以它是 ksc5601 的扩展,只要您不使用任何受扩展影响的字符(想想上面的 ASCII 示例),就会导致相同的结果。

p>

通常,这不会使您的前提无效。只有在双方使用相同的编码时才能保证正确的结果。这只是意味着,测试代码要困难得多,因为它需要足够的测试输入数据来发现错误。例如。西方世界的一个常见错误是将 iso-latin-1 (ISO 8859-1) 与 Windows 代码页 1252 混淆,这可能不会被简单的文本发现。

关于具有两种不同字符集的 Java String.getBytes(charset) 和 new String(bytes, charset),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55176094/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com