gpt4 book ai didi

java - 为什么 Char 实际上是 Java 中的 NumericType,而不是 SymbolicType 或 String?

转载 作者:行者123 更新时间:2023-11-29 05:01:09 26 4
gpt4 key购买 nike

关于Java语法,有一个NumericType,它由IntegralTypeFloatingPointType组成。 IntegralType 是 byteshortintlongchar

同时,我可以将单个字符赋值给char变量。

char c1 = 10;
char c2 = 'c';

所以这是我的问题。为什么 char 是数字类型以及 JVM 如何将 'c' 转换为数字?

最佳答案

Why char in numeric type...

使用数字表示字符作为表格的索引是计算机处理文本的标准方式。它叫做character encoding并且有着悠久的历史,至少可以追溯到电报时代。长期以来,个人计算机使用 ASCII(7 位编码 = 127 个字符加 nul),然后使用“扩展 ASCII”(各种形式的 8 位编码,其中“上部”128 个字符有多种解释),但是由于字符集有限,这些现在已经过时并且仅适用于小众用途。在个人计算机之前,流行的是 EBCDIC 及其前身 BCD。现代系统使用 Unicode (通常通过存储其一个或多个 transformations,例如 UTF-8 或 UTF-16)或各种标准化“代码页”,例如 Windows-1252ISO-8859-1 .

...and how JVM convert 'c' to a number?

Java 的 numeric char values通过 Unicode 映射到字符和从字符映射(这是 JVM 知道 'c' 的值是 0x0063 或 'é' 是 0x00E9 的方式)。具体来说,一个 char 值映射到一个 Unicode 代码点 并且字符串是 sequences of code points .

关于 char 数据类型的内容很多,包括为什么值是 16 位宽,在 JavaDoc of the Character class 中。 :

Unicode Character Representations

The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode Standard.)

The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points. The lower (least significant) 21 bits of int are used to represent Unicode code points and the upper (most significant) 11 bits must be zero. Unless otherwise specified, the behavior with respect to supplementary characters and surrogate char values is as follows:

  • The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.

  • The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).

In the Java SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding. For more information on Unicode terminology, refer to the Unicode Glossary.

关于java - 为什么 Char 实际上是 Java 中的 NumericType,而不是 SymbolicType 或 String?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32050815/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com