gpt4 book ai didi

html - 如何通过HTTP headers知道HTML内容的字符集?

转载 作者:行者123 更新时间:2023-12-02 20:46:13 32 4
gpt4 key购买 nike

我知道HTTP header:Content-Type中的参数charset=可以用来确定HTML内容的字符集。但如果 Content-Type header 中缺少该参数,我如何知道 HTML 内容的字符集?

我也知道有这样的标签

"meta charset="utf-8""

在HTML中用于指定字符集。但是我们只有在解析 HTML 后才能得到该标签,并且解析 HTML 需要首先知道字符集。

最佳答案

在没有明确的情况下 charset Content-Type 中的属性 header ,通过不同传输方式发送的不同媒体类型具有不同的默认字符集。

例如,仅显示一些定义:

RFC 2046 ,部分4.1.2 MIME 规范的内容如下:

Unlike some other parameter values, the values of the charset parameter are NOT case sensitive. The default character set, which must be assumed in the absence of a charset parameter, is US-ASCII.

RFC 2616 ,部分3.7.1 HTTP 协议(protocol)规范的内容如下:

The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.

后来被 RFC 7231 逆转, Appendix B :

The default charset of ISO-8859-1 for text media types has been removed; the default is now whatever the media type definition says. Likewise, special treatment of ISO-8859-1 has been removed from the Accept-Charset header field. (Section 3.1.1.3 and Section 5.3.3).

RFC 3023 ,部分3.1 , 3.3 , 3.6 ,和 8.5 XML 媒体类型规范说:

Conformant with [RFC2046], if a text/xml entity is received with the charset parameter omitted, MIME processors and XML processors MUST use the default charset value of "us-ascii"[ASCII]. In cases where the XML MIME entity is transmitted via HTTP, the default charset value is still "us-ascii". (Note: There is an inconsistency between this specification and HTTP/1.1, which uses ISO-8859-1[ISO8859] as the default for a historical reason. Since XML is a new format, a new default should be chosen for better I18N. US-ASCII was chosen, since it is the intersection of UTF-8 and ISO-8859-1 and since it is already used by MIME.)

The charset parameter of text/xml-external-parsed-entity is handled the same as that of text/xml as described in Section 3.1.

The following list applies to text/xml, text/xml-external-parsed-entity, and XML-based media types under the top-level type "text" that define the charset parameter according to this specification:

...

  • If the charset parameter is not specified, the default is "us-ascii". The default of "iso-8859-1" in HTTP is explicitly overridden.

This example shows text/xml with the charset parameter omitted. In this case, MIME and XML processors MUST assume the charset is "us-ascii", the default charset value for text media types specified in [RFC2046]. The default of "us-ascii" holds even if the text/xml entity is transported using HTTP.

Omitting the charset parameter is NOT RECOMMENDED for text/xml. For example, even if the contents of the XML MIME entity are UTF-16 or UTF-8, or the XML MIME entity has an explicit encoding declaration, XML and MIME processors MUST assume the charset is "us-ascii".

RFC 7159 ,部分8.111 ,JSON 规范说:

JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32).

Implementations MUST NOT add a byte order mark to the beginning of a JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

Note: No "charset" parameter is defined for this registration. Adding one really has no effect on compliant recipients.

因此,一般来说,如果您想知道给定资源使用的字符集,并且该字符集不是通过外部方式表达的,例如 charset Content-Type 的属性 header ,那么您必须确定您正在处理的数据类型,然后根据该数据类型的规范概述确定其字符集。

就您而言,您正在通过 HTTP 处理 HTML,因此 RFC 2616 规则适用于您。 HTML 5 spec ,部分8.2.2.2定义了一个非常详细的算法,用于在没有 charset 时确定 HTML 的字符集属性在 Content-Type 中指定 header 。该算法首先检查 UTF BOM 是否存在。 ,如果不存在则假设 HTML 是 8 位并解析它以查找任何 <meta>包含字符集或语言声明的标签。

XML 1.0 specification , Appendix F ,还定义了一种算法,可以轻松确定 XML prolog 使用的字符集,因此您可以阅读其 Encoding属性(如果存在)以确定剩余 XML 的字符集。

关于html - 如何通过HTTP headers知道HTML内容的字符集?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44344533/

32 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com