html - 如何通过HTTP headers知道HTML内容的字符集？-6ren

html - 如何通过HTTP headers知道HTML内容的字符集？

转载作者：行者123 更新时间：2023-12-02 20:46:13

32

4

我知道HTTP header:Content-Type中的参数charset=可以用来确定HTML内容的字符集。但如果 Content-Type header 中缺少该参数，我如何知道 HTML 内容的字符集？

我也知道有这样的标签

"meta charset="utf-8""

在HTML中用于指定字符集。但是我们只有在解析 HTML 后才能得到该标签，并且解析 HTML 需要首先知道字符集。

最佳答案

在没有明确的情况下 charset Content-Type 中的属性 header ，通过不同传输方式发送的不同媒体类型具有不同的默认字符集。

例如，仅显示一些定义:

RFC 2046 ，部分4.1.2 MIME 规范的内容如下:

Unlike some other parameter values, the values of the charset parameter are NOT case sensitive. The default character set, which must be assumed in the absence of a charset parameter, is US-ASCII.

RFC 2616 ，部分3.7.1 HTTP 协议(protocol)规范的内容如下:

The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.

后来被 RFC 7231 逆转, Appendix B :

The default charset of ISO-8859-1 for text media types has been removed; the default is now whatever the media type definition says. Likewise, special treatment of ISO-8859-1 has been removed from the Accept-Charset header field. (Section 3.1.1.3 and Section 5.3.3).

RFC 3023 ，部分3.1 , 3.3 , 3.6 ，和 8.5 XML 媒体类型规范说:

Conformant with [RFC2046], if a text/xml entity is received with the charset parameter omitted, MIME processors and XML processors MUST use the default charset value of "us-ascii"[ASCII]. In cases where the XML MIME entity is transmitted via HTTP, the default charset value is still "us-ascii". (Note: There is an inconsistency between this specification and HTTP/1.1, which uses ISO-8859-1[ISO8859] as the default for a historical reason. Since XML is a new format, a new default should be chosen for better I18N. US-ASCII was chosen, since it is the intersection of UTF-8 and ISO-8859-1 and since it is already used by MIME.)

The charset parameter of text/xml-external-parsed-entity is handled the same as that of text/xml as described in Section 3.1.

The following list applies to text/xml, text/xml-external-parsed-entity, and XML-based media types under the top-level type "text" that define the charset parameter according to this specification:

...

If the charset parameter is not specified, the default is "us-ascii". The default of "iso-8859-1" in HTTP is explicitly overridden.

This example shows text/xml with the charset parameter omitted. In this case, MIME and XML processors MUST assume the charset is "us-ascii", the default charset value for text media types specified in [RFC2046]. The default of "us-ascii" holds even if the text/xml entity is transported using HTTP.

Omitting the charset parameter is NOT RECOMMENDED for text/xml. For example, even if the contents of the XML MIME entity are UTF-16 or UTF-8, or the XML MIME entity has an explicit encoding declaration, XML and MIME processors MUST assume the charset is "us-ascii".

RFC 7159 ，部分8.1和 11 ，JSON 规范说:

JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32).

Implementations MUST NOT add a byte order mark to the beginning of a JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

Note: No "charset" parameter is defined for this registration. Adding one really has no effect on compliant recipients.

因此，一般来说，如果您想知道给定资源使用的字符集，并且该字符集不是通过外部方式表达的，例如 charset Content-Type 的属性 header ，那么您必须确定您正在处理的数据类型，然后根据该数据类型的规范概述确定其字符集。

就您而言，您正在通过 HTTP 处理 HTML，因此 RFC 2616 规则适用于您。 HTML 5 spec ，部分8.2.2.2定义了一个非常详细的算法，用于在没有 charset 时确定 HTML 的字符集属性在 Content-Type 中指定 header 。该算法首先检查 UTF BOM 是否存在。，如果不存在则假设 HTML 是 8 位并解析它以查找任何 <meta>包含字符集或语言声明的标签。

XML 1.0 specification , Appendix F ，还定义了一种算法，可以轻松确定 XML prolog 使用的字符集，因此您可以阅读其 Encoding属性(如果存在)以确定剩余 XML 的字符集。

关于html - 如何通过HTTP headers知道HTML内容的字符集？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44344533/

32

4

0

文章推荐： api - 如何使用 API 设置 Google 日历事件的颜色

文章推荐： r - 在R中: how to sum a variable by group between two dates

文章推荐： php - 如果使用 dd 方法，Laravel 5.4 session 不会被设置

java - Jetty 字符集 utf-8 与字符集 UTF-8
我正在使用使用jetty的Spring-Web应用程序: org.springframework.boot spring-boot-starter-web
字符集 [ ] 中的正则表达式捕获组 ( )
我想匹配空格字符( )仅当它们后跟一个哈希( # )。这是什么( #)下面是尝试做的，这是一个捕获组。 (我尝试转义括号，否则无法在组集中正确识别括号)。但是，这是行不通的。下面的正则表达式 /#
java - 字符集
我正在实现一个词法扫描器。我如何实现字符集？大多数字符集都采用范围形式，即 A-Z、h-L 等。我必须检查字符 ch 是否是字符集的成员。除了数组和位集之外，我可以在 Java 中使用哪种高效的数据结
mySQL 字符集
我今天注意到我们的数据库使用字符集“utf8 -- UTF-8 Unicode”和排序规则“utf8_general_ci”，但里面的大多数表和列都使用 CHARSET=latin1。我会遇到任何问题
Mysql保加利亚语言，字符集
我有一个包含多种语言的 Mysql 表，一种语言一个字段。我的字符集是utf_general_ci 当我用 phpMyAdmin 查看表格时，我有一个保加利亚语页面，如下所示: Ð—Ð° Ð½Ð°Ñ
vb.net 字符集
根据 MSDN vb.net uses this extended character set .根据我的经验，它实际上使用了这个: 我错过了什么？为什么它说它使用一个而使用另一个？难道我做错了什么
查询参数的 Symfony2 字符集
我在 Symfony2 中有一个项目，它在我的本地主机上运行良好，但是在将其移动到外部服务器后问题已经开始。我没有从包含波兰语字符的数据库中看到任何结果名称在 Profiler 中，我检查了查询:
php - CKEditor 字符集
我更新了我的网络应用程序以使用 UTF-8 而不是 ANSI。我做了以下措施来定义字符集: mysql_set_charset("utf8"); // PHP // HTML utf8_gener
c - 字符集/位串的减法运算
typedef unsigned char Set; Set s1,s2; s1 = 0xda; PRINT(s1); printf("%d\n", s1); s2 = -s1; pri
PHP/Mysql 字符集
我有一个 PHP/MySQL 应用程序，它需要在幕后处理 UTF-8 字符(UTF-8 字符不会显示在屏幕上)。 UTF-8 字符来自 PHP cURL 请求。我需要做什么才能使 PHP 和 MyS
用户表的 MySQL 字符集
我正在使用 utf8_general_ci 作为字符集在 MySQL 中构建一个用户表。 1-) 使用这个字符集，两个用户一个叫 Bob 另一个叫 bob 看起来是同一个，对吧？我不知道这可能会导致一
mysql - 更改默认排序规则/字符集
我知道之前已经回答过这个问题，但我发现的解决方案不适用于我的系统(我已经测试过了)。我想更改 Mysql 中的默认排序规则。这似乎设置为 latin1_swedish_ci，我想将其更改为 UTF8
需要 MySQL 字符集
我正在开发一个母语学习应用程序。我需要将一些字符存储为“ẽũ”。我的数据库设置为具有默认排序规则的 utf-8 字符集，以及受此字符影响的表。问题是当我尝试使用常规 SQL 插入添加一行时: INS
html - jsFiddle 字符集
有什么方法可以定义字符集，以及属于的其他信息吗？ jsFiddle 上的标签？我知道他们侧面板的“信息”部分，但不允许标签。提前致谢! 最佳答案是的，把它放在CSS部分: 使用 jQ
对象属性的 javascript 字符集
我正在创建一个对象运行时: var myObj = {}; myObj[propertyName] = propertyValue; propertyName 是变量，如果它像“a.b”，我的对象就有
C:如何将一个字符添加到另一个字符/字符集？
假设我有一个值为 42 (*) 的字符。我需要在 n 行中打印这个字符，n 由用户定义。但是，对于每个换行，都必须打印另一个 *。如果用户输入“6”，那么结果将是这样的: * ** *** ****
c++ - 字符集 - 不清楚
该标准定义了基本源字符集基本执行字符集及其对应的宽字符它还定义了“执行字符集”及其对应的宽字符，如下所示 $2.2/3- "The execution character set and the
Java inputStreamReader 字符集
我想 ping 目标 IP 地址并接收响应。为此，我在 Java 中使用带有 runtime.exec 方法和进程类的 Windows 命令行。我正在使用 inputStreamReader 获取响应
MySQL Workbench 字符集
是否有任何方法可以更改 MySQL Workbench 字符集？我的架构使用 UTF-8，当我查看表数据(另存为 UTF-8)或手动添加数据时，出现字符集错误，可能 MySQL Workbench 使
winapi - 如何确定使用哪个 LOGFONT 字符集？
我有一个文本选择/规范对话框，用户可以从列表中指定字体，设置粗体、斜体等。然后我使用 LOGFONT 中的信息创建一个 CFont 随后在 CEdit 中用作预览。但是，如果用户选择像 Wingdi

首页

博学

6Ren·AI

商城

html - 如何通过HTTP headers知道HTML内容的字符集？