perl - 我怎么能捕捉到 "Unicode non-character"警告？-6ren

perl - 我怎么能捕捉到 "Unicode non-character"警告？

转载作者：行者123 更新时间：2023-12-04 10:02:09

我怎么能捕捉到“Unicode 非字符 0xffff 对于交换是非法的”警告？

#!/usr/bin/env perl
use warnings;
use 5.012;
use Try::Tiny;

use warnings FATAL => qw(all);

my $character;

try {
    $character = "\x{ffff}";
} catch {
    die "---------- caught error ----------\n";
};

say "something";

输出:

# Unicode non-character 0xffff is illegal for interchange at ./perl1.pl line 11.

最佳答案

A Perl 5.10.0⋯5.13.8 错误

我将假设您实际上并不想“捕捉”这个警告，而是要生存或忽略它。如果你真的想捕获它，那么可能有更简单的方法来做到这一点。

但首先要知道的是，不存在非法代码点，只有不可互换的代码点。

你只需要使用 no warnings "utf8"对于需要使用完整 Unicode 范围(或更多)的范围。 无需使用 eval为了这。 所需要的只是一个范围内的警告抑制。即使在较新的 perls 上也没有必要。

所以而不是这个:

$char = chr(0xFFFE);

写(在旧的 perls 上):

$char = do { no warnings "utf8"; chr(0xFFFE) };

这也是涉及此类字符的模式匹配的情况:

 $did_match = do { no warnings "utf8" ; $char =~ $char);

将导致警告或致命，这取决于你的 perl 有多旧，或者根本没有，取决于你的 perl 有多新。

您只能在以这种方式重要的版本上禁用与 utf8 相关的警告:

no if $^V < 5.13.9, qw<warnings utf8>;

“在下一个版本中修复”

真正有趣的是他们(阅读:Perl5 Porters，特别是 Karl Williamson)修复了需要 no warnings "utf8" 的错误。保护只是为了处理任何代码点。这只是您可能需要小心的输出。 watch :

% perl5.10.0 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode character 0xfffe is illegal at -e line 1.

% perl5.11.3 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode non-character 0xfffe is illegal for interchange at -e line 1.

% perl5.12.0 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode non-character 0xfffe is illegal for interchange at -e line 1.

% perl5.12.3 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode non-character 0xfffe is illegal for interchange at -e line 1.

% perl5.13.0 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode non-character 0xfffe is illegal for interchange at -e line 1.

% perl5.13.8 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode non-character 0xfffe is illegal for interchange at -e line 1.

% perl5.13.9 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Ok

% perl5.13.10 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Ok

最安全的做法是放 no warnings "utf8"在您需要的地方。但是不需要 eval !

从 5.13.10 开始，因此在 5.14 中，utf8 警告分为三个子类别: surrogate对于 UTF-16， nonchar如下所述，以及 non_unicode对于 super ，也定义如下。

All-Perl 交换是安全的

不过，您可能不想抑制输出中的“非法交换”警告，因为这是真的。好吧，除非你使用 Perl 的 "utf8"编码，这与其 "UTF‑8" 不同编码，很奇怪。 "utf8"编码比正式标准更宽松，因为它允许我们做比其他方式更有趣的事情。

然而 , 当且仅当您拥有 100% 纯 perl 数据路径时，您仍然可以使用任何您想要的代码点，包括高达 ᴍᴀxɪɴᴛ 的非 unicode 代码点。这在 32 位机器上是 0x7FFF_FFFF，而在 64 位机器上则是无法形容的巨大:0xFFFF_FFFF_FFFF_FFFF!这不仅仅是一个 super ；这是一个 super !

% perl -Mwarnings -CS -E 'my $a = chr(0xFFFF_FFFF); say $a ' | 
  perl -Mwarnings -CS -nlE 'say "got ord ", ord'
Code point 0xFFFFFFFF is not Unicode, may not be portable at -e line 1.
got ord 4294967295

% perl -Mwarnings -CS -E 'no warnings "utf8"; my $a = chr(0xFFFF_FFFF); say $a' |
 perl -Mwarnings -CS -nlE 'say "got ord ", ord'
got ord 4294967295

% perl -Mwarnings -CS -E 'no warnings "utf8"; my $a = chr(0xFFFF_FFFF_FFFF_FFFF); say $a' |
  perl -Mwarnings -CS -nlE 'say "got ord ", ord'
Hexadecimal number > 0xffffffff non-portable at -e line 1.
got ord 18446744073709551615

% perl -Mwarnings -CS -E 'no warnings qw[ utf8 portable ]; my $a = chr(0xFFFF_FFFF_FFFF_FFFF);  say $a ' |
  perl -Mwarnings -CS -nlE 'say "got ord ", ord'
got ord 18446744073709551615

请注意，在 32 位机器上，最后一个会产生以下结果:

Integer overflow in hexadecimal number at -e line 1.
got ord 4294967295

各种非法交换的非字符

有几种——实际上相当多——不同类别的代码点是不合法的。

任何代码点，如 (ord(ᴄᴏᴅᴇᴘᴏɪɴᴛ) & 0xFFFE) == 0xFFFE是真的。这涵盖了所有可能平面中的最后两个代码点。由于它跨越 17 个平面，因此 Unicode 定义了 34 个这样的代码点。这些不是字符，尽管它们是 Unicode 代码点。让我们称这些为 Penults。他们属于 nonchar 5.13.10 或更高版本的警告类。

从 U+FDD0 开始的 32 个代码点。这些保证是非字符，当然它们仍然是 Unicode 代码点。与之前的倒数第二组一样，这些也属于 nonchar 5.13.10 或更高版本的警告类。

1024 个高代理和 1024 个低代理，它们被雕刻成斜线，使 UTF-16 可以用于所有尝试 UCS-2 而不是 UTF-8 或 UTF-32 的愚蠢系统。这削弱了有效 Unicode 代码点的范围，将它们限制为仅前 21 位值。 代理仍然是代码点 .它们只是不能用于交换，因为它们不能总是由聪明的 UTF-16 正确表示。在 5.13.10 或更高版本下，这些由 surrogate 控制警告子类。

除此之外，我们现在高于 Unicode 范围。我会称这些为 super 。在 32 位机器上，除了 Unicode 为您提供的标准 21 位之外，您仍然拥有(10 或)11 位。 Perl 可以很好地使用这些。这给出了您可以在 Perl 程序中使用的总共 2**32 个代码点(好吧，或者至少 2**31，由于有符号溢出)。你得到了 100 万个 Unicode 代码点，但随后你会得到几十亿个 Super 代码点，超出了你可以在 Perl 中使用的那些代码点。如果您运行的是 5.13.10 或更高版本，您可以通过 non_unicode 控制对这些的访问。警告子类。

Perl 仍然遵循有关 Penults 的规则，即使在 Super 范围内也是如此。在 32 位机器上有 480 个这样的 Superpenults，在 64 位机器上有更多。

如果你真的想以不可移植的方式播放它，那么如果你有原生的 64 位整数，那么你还有 32 位或 33 位高于 super 给你的。您现在有 18 quintillion 446 千万亿 744 万亿 730 亿 7.09 亿 551 千和 616 个字符。您有 整个艾字节 不同的代码点!这远远超出了我将其称为 Hypermegas 的范围。好的，所以这些不是很便携，因为它们需要一个真正的 64 位平台。他们有点陌生，所以也许我们应该写那个Ὑπέρμεγας来吓跑人们。 :) 请注意，针对倒数第二者的规则仍然适用于 hypermegas。

测试程序

我写了一个小程序来证明这些代码点很酷。

testing Penults             passed all 34 codepoints
testing Super_penults       passed all 480 codepoints
testing Noncharacters       passed all 32 codepoints
testing Low_surrogates      passed all 1024 codepoints
testing High_surrogates     passed all 1024 codepoints
testing Supers              passed all 8 codepoints
testing Ὑπέρμεγας            passed all 10 codepoints

注意 :上面的最后一行显示了 SO 的 hell 高亮代码中的另一个愚蠢的错误。注意上面最后一个 WɪᴋɪWᴏʀᴅ， \p{Greek}一，被排除在着色方案之外？这意味着他们只寻找大写的 ASCII 身份标识。过时了!如果你不打算使用像 \p{Uppercase} 这样的东西，为什么还要接受 ᴜɴɪᴄᴏᴅᴇ正确吗？正如你在我的程序中看到的，我有一个 @ὑπέρμεγας数组，我们 ᴍᴏᴅᴇʀɴ ᴘʀᴏɢʀᴀᴍᴍɪɴɢ ʟᴀɴɢᴜᴀɢᴇ 处理这个非常好。 ☺

我显然没有跑过所有的 super 或 super 。而在 32 位机器上，你只会得到 4 个经过测试的 hypers。我也没有测试任何超倒数。

这是测试程序，它可以在 5.10 及更高版本的所有版本上干净地运行。

#!/usr/bin/env perl
#
# hypertest - show how to safely use code points not legal for interchange in Perl
# 
# Tom Christiansen
# tchrist@perl.com
# Sat Feb 26 16:38:44 MST 2011

use utf8;
use 5.10.0;
use strict;
use if $] > 5.010, "autodie";
use warnings FATAL => "all";

use Carp;

binmode(STDOUT, ":utf8");
END { close STDOUT }

$\ = "\n";

sub ghex(_);

my @penults = map { 
    (0x01_0000 * $_) + 0xfffE, 
    (0x01_0000 * $_) + 0xfffF, 
} 0x00 .. 0x10;

my @super_penults = map { 
    (0x01_0000 * $_) + 0xfffE, 
    (0x01_0000 * $_) + 0xfffF, 
} 0x10 .. 0xFF;

my @low_surrogates  = map { 0xDC00 + $_ } 0x000 .. 0x3FF;
my @high_surrogates = map { 0xD800 + $_ } 0x000 .. 0x3FF;

my @noncharacters = map { 0xFDD0 + $_ } 0x00 .. 0x1F;

my @supers = ( 
    0x0011_0000,  0x0100_0000,  0x1000_0000,  0x1F00_0000,  
    0x1FFF_FFFF,  0x3FFF_FFFF,  0x7FFF_FFFF,  0x7FFF_FFFF,  
);

# these should always work anywhere 
my @ὑπέρμεγας = ( 
    0x8000_0000,   0xF000_0000,   
    0x3FFF_FFFF,   0xFFFF_FFFF,  
);

####
# now we go fishing for 64-bit ὑπέρμεγας
####

eval q{
    use warnings FATAL => "overflow";
    no  warnings "portable";
    push @ὑπέρμεγας => ( 
        0x01_0000_0000, 
        0x01_FFFF_FF00,
    );
};
eval q{
    use warnings FATAL => "overflow";
    no  warnings "portable";
    push @ὑπέρμεγας => (
        0x0001_0000_0000_0000,
        0x001F_0000_0000_0000,
        0x7FFF_FFFF_FFFF_FFFF,
        0xFFFF_FFFF_FFFF_FFFF,
    );
};

# more than 64??
eval q{
    use warnings FATAL => "overflow";
    no  warnings "portable";
    push @ὑπέρμεγας => (
        0x01_0001_0000_0000_0000,
        0x01_7FFF_FFFF_FFFF_FFFF,
        0x01_FFFF_FFFF_FFFF_FFFF,
    );
    1;
};


my @testpairs = (
    penults         => \@penults,
    super_penults   => \@super_penults,
    noncharacters   => \@noncharacters ,
    low_surrogates  => \@low_surrogates,
    high_surrogates => \@high_surrogates,
    supers          => \@supers,
    ὑπέρμεγας       => \@ὑπέρμεγας,   
);

while (my($name, $aref) = splice(@testpairs, 0, 2)) {
    printf "testing %-20s", ucfirst $name;

    my(@passed, @failed);

    for my $codepoint (@$aref) {

        use warnings FATAL => "all";

        my $char = do {
            # next line not needed under 5.13.9 or better: HURRAY!
            no warnings "utf8";
            chr(0xFFFF) && chr($codepoint);
        };

        my $regex_ok = do {
            # next line not needed under 5.13.9 or better: HURRAY!
            no warnings "utf8";
            $char =~ $char;
            1;
        };

        my $status = defined($char) && $regex_ok;

        push @{ $status ? \@passed : \@failed }, $codepoint;
    }

    my $total  = @$aref;
    my $passed = @passed;
    my $failed = @failed;

    given($total) {
        when ($passed)  { print "passed all $total codepoints" }
        when ($failed)  { print "failed all $total codepoints" }
        default         {
            print "of $total codepoints, failed $failed and passed $passed";
            my $flist = join(", ", map { ghex } @failed);
            my $plist = join(", ", map { ghex } @passed);
            print "\tpassed: $plist";
            print "\tfailed: $flist";
        }
    }

}

sub ghex(_) {
    my $num = shift();
    my $hex = sprintf("%X", $num);
    return $hex if length($hex) < 5;
    my $flip = reverse $hex;
    $flip =~ s<
        ( \p{ahex} \p{ahex} \p{ahex} \p{ahex} )
        (?= \p{ahex} )
        (?! \p{ahex}* \. )
    ><${1}_>gx;
    return "0x" . reverse($flip);
}

关于perl - 我怎么能捕捉到 "Unicode non-character"警告？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/5127725/

文章推荐： drupal - Drush 命令将新的模块依赖项添加到功能

文章推荐： jsf - 使用 JSF selectOneListbox 存储 Liferay 7 portlet 首选项

文章推荐： email - 如何在电子邮件中使用 Hudson "Parameterized Build"值

文章推荐： jboss - Wildfly 和 jackson : LinkageError

java - 如何让我的程序将每个字符打印为 "Character #1: (character) , Character #2: (character), etc"？
int i; System.out.print("Please enter a string: "); String string_1 = input.nextLine(); System.out
c# - 用于检查密码是否为 "8 characters including 1 uppercase letter, 1 special character, alphanumeric characters"的正则表达式
我想要一个正则表达式来检查 a password must be eight characters including one uppercase letter, one special charac
Python 命令行 "characters"返回 'characters'
在此先感谢您的帮助。在命令行输入“example”时，Python 返回“example”。我在网上找不到任何东西来解释这一点。所有引用资料都在 print 命令的上下文中谈到字符串，我得到了所有关
character-encoding - 如何解码乱码编码: Special Character Encoding
我有 CSV 格式的数据，这些数据在字符编码方面被严重打乱，可能在不同的软件应用程序(LibreOffice Calc、Microsoft、Excel、Google Refine、自定义 PHP/My
character-encoding - Sublime Text : Not representable characters
我正在为 Latex 使用 Sublime Text，所以我需要使用特定的编码。但是，在某些情况下，当我粘贴从不同程序(大多数情况下为单词/浏览器)复制的文本时，我收到以下消息: "Not all c
Flutter FormatException : Unexpected character (at character 1)
在 flutter 中，我使用了一个php文件，该文件从数据库查询返回json响应，但是当我尝试解码json时，出现此错误: E/flutter ( 8294): [ERROR:flutter/lib
firebase - Flutter格式异常: Unexpected Character (at character 1)
我在 Flutter 项目中遇到异常。错误如下所示: Exception has occurred. FormatException (FormatException: Unexpected char
java - : character >= '0' and character <= 0 之间的差异
这个问题已经有答案了: Why doesn't my compare work between char and int in Java? (4 个回答) 已关闭 3 年前。我试图在我的代码中找出
firebase - Flutter格式异常: Unexpected Character (at character 1)
我在 Flutter 项目中遇到异常。错误如下所示: Exception has occurred. FormatException (FormatException: Unexpected char
Python脚本: Character 1 equal to character n
我是 python 新手，需要一些帮助。我应该编写一个脚本，从键盘读取单词，直到输入单词 999。对于除 999 之外的每个单词，报告该单词是否有效。如果单词的第一个字符等于最后一个字符，则该单词有效
C# 语法 : What character is the "new line" character?
我正在实现自己的词法分析器，并且刚刚了解了 C# 如何处理字 rune 字:https://msdn.microsoft.com/en-us/library/aa691087(v=vs.71).asp
JavaScript : Replace all character by empty after a specific character
我有这个字符串: var test = "toto@test.com"; 我想用空值替换“@”字符后的所有字符。我想得到这个字符串: var test = "toto" 最佳答案试试这个: test
python - 运算符不存在 : character varying + character varying
我将数据库从 sqlite 更改为 postgresql 以用于我网站的生产，但出现此错误。当我在本地使用 sqlite 时，它没有出现这个错误。使用 Django。 ProgrammingErr
C语言 : there is a trailing character after the last character of my output
我正在为我的实验表制作凯撒密码，并使其能够加密 3 代入(凯撒密码)，这是练习的重点。但是有一件事困扰着我。首先，如果我输入 3 以外的字符，则有一个尾随字符。例如，输入“恶意软件”，然后输入 2 作
c - 使用八进制字符给出警告 : multi-character character constant
遵循 this question 中的逻辑，以下代码应该有效: #include int main(){ printf("%c", '\0101'); return 0; } 然而，
c# - "Too many characters in character literal error"
我在处理一段代码时遇到错误: Too many characters in character literal error 使用 C# 和 switch 语句遍历字符串缓冲区并读取标记，但在这一行中出
xslt - xsl :character-map to replace special characters
给定一个元素，其值为: Distrib = SU & Prem <> 0 我要转<或 >进入 <或 >因为下游应用程序需要
r - R 中的 as.character() 和 as( ,"character") 有什么区别
从表面上看，他们似乎都在做同样的事情。但似乎是后者as(,"character")更强大。作为示例，请考虑以下内容: library(rvest) temp % html_node("div p")
python - Python语法错误: unexpected character after line continuation character
我刚开始使用python，所以很可能只是在做一些愚蠢的事情。我正在从表中读取数据，需要将它们放入txt文件的列中。我无法说服我的代码创建新行。这是我的代码- file = open("test_m.
arrays - "character*10::a"和 "character::a(10)"之间的区别
在尝试刷新我的项目的 Fortran 90 知识时，我在使用内部文件时遇到了一些奇怪的情况。考虑示例代码: ! ---- internal_file_confusion.f90 ---- progra

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

perl - 我怎么能捕捉到 "Unicode non-character"警告？