gpt4 book ai didi

perl - 使用 Encode::encode 和 "utf8"

转载 作者:行者123 更新时间:2023-12-03 01:35:23 29 4
gpt4 key购买 nike

您可能知道,在 Perl 中,“utf8”意味着 Perl 对 UTF-8 的宽松理解,它允许使用技术上不是 UTF-8 中有效代码点的字符。相比之下,“UTF-8”(或“utf-8”)是 Perl 对 UTF-8 更严格的理解,它不允许无效的代码点。

我有一些与此区别相关的使用问题:

  1. Encode::encode 默认情况下会用替换字符替换无效字符。即使您传递更宽松的“utf8”作为编码,这是真的吗?

  2. 当您读取和写入使用“UTF-8”打开的文件时会发生什么?字符替换是否发生在坏字符上,还是发生了其他情况?

  3. 使用 open 与“>:utf8”等图层和“>:encoding(utf8)”等图层有什么区别?这两种方法都可以与“utf8”和“UTF-8”一起使用吗?

最佳答案

<表类=“s-表”><标题>读取时,
除序列长度之外的无效编码读取时、
Unicode 之外、
Unicode 非字符或
Unicode 代理写入时、
Unicode 之外、
Unicode 非字符或
Unicode 代理 <正文> :encoding(UTF-8) 警告和替换警告和替换警告和替换 :encoding(utf8) 警告和替换接受警告和输出 :utf8 损坏的标量接受警告和输出

(这是 Perl 5.26 中的状态。)

请注意:encoding(UTF-8)实际上使用 utf8 进行解码,然后检查结果字符是否在可接受的范围内。这减少了因错误输入而产生的错误消息的数量,所以这是很好的。

(编码名称不区分大小写。)

<小时/>

用于生成上表的测试:

正在阅读

  • :encoding(UTF-8)

      $ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
    perl -MB -nle'
    use open ":std", ":encoding(UTF-8)";
    my $sv = B::svref_2object(\$_);
    printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
    '
    utf8 "\xFFFF" does not map to Unicode.
    utf8 "\xD800" does not map to Unicode.
    utf8 "\x200000" does not map to Unicode.
    utf8 "\x80" does not map to Unicode.
    E9 (internal: C3.A9, UTF8=1)
    5C.78.7B.46.46.46.46.7D = \x{FFFF} (internal: 5C.78.7B.46.46.46.46.7D, UTF8=1)
    5C.78.7B.44.38.30.30.7D = \x{D800} (internal: 5C.78.7B.44.38.30.30.7D, UTF8=1)
    5C.78.7B.32.30.30.30.30.30.7D = \x{200000} (internal: 5C.78.7B.32.30.30.30.30.30.7D, UTF8=1)
    5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
  • :encoding(utf8)

      $ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
    perl -MB -nle'
    use open ":std", ":encoding(utf8)";
    my $sv = B::svref_2object(\$_);
    printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
    '
    utf8 "\x80" does not map to Unicode.
    E9 (internal: C3.A9, UTF8=1)
    FFFF (internal: EF.BF.BF, UTF8=1)
    D800 (internal: ED.A0.80, UTF8=1)
    200000 (internal: F8.88.80.80.80, UTF8=1)
    5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
  • :utf8

      $ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
    perl -MB -nle'
    use open ":std", ":utf8";
    my $sv = B::svref_2object(\$_);
    printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
    '
    E9 (internal: C3.A9, UTF8=1)
    FFFF (internal: EF.BF.BF, UTF8=1)
    D800 (internal: ED.A0.80, UTF8=1)
    200000 (internal: F8.88.80.80.80, UTF8=1)
    Malformed UTF-8 character: \x80 (unexpected continuation byte 0x80, with no preceding start byte) in printf at -e line 4, <> line 5.
    0 (internal: 80, UTF8=1)

写入时

  • :encoding(UTF-8)

      $ perl -e'
    use open ":std", ":encoding(UTF-8)";
    print "\x{E9}\n";
    print "\x{FFFF}\n";
    print "\x{D800}\n";
    print "\x{20_0000}\n";
    ' >a
    Unicode non-character U+FFFF is not recommended for open interchange in print at -e line 4.
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 5.
    Code point 0x200000 is not Unicode, may not be portable in print at -e line 6.
    "\x{ffff}" does not map to utf8.
    "\x{d800}" does not map to utf8.
    "\x{200000}" does not map to utf8.

    $ od -t c a
    0000000 303 251 \n \ x { F F F F } \n \ x { D
    0000020 8 0 0 } \n \ x { 2 0 0 0 0 0 } \n
    0000040

    $ cat a
    é
    \x{FFFF}
    \x{D800}
    \x{200000}
  • :encoding(utf8)

      $ perl -e'
    use open ":std", ":encoding(utf8)";
    print "\x{E9}\n";
    print "\x{FFFF}\n";
    print "\x{D800}\n";
    print "\x{20_0000}\n";
    ' >a
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 4.
    Code point 0x200000 is not Unicode, may not be portable in print at -e line 5.

    $ od -t c a
    0000000 303 251 \n 355 240 200 \n 370 210 200 200 200 \n
    0000015

    $ cat a
    é


  • :utf8

    :encoding(utf8) 相同的结果.

使用 Perl 5.26 进行测试。

<小时/>

Encode::encode by default will replace invalid characters with a substitution character. Is that true even if you are passing the looser "utf8" as the encoding?

Perl 字符串是 32 位或 64 位字符的字符串,具体取决于构建。 utf8可以编码任何72位整数。因此,它能够对所有需要编码的字符进行编码。

关于perl - 使用 Encode::encode 和 "utf8",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49038533/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com