gpt4 book ai didi

perl - 将十六进制转换为 UTF8 在 perl 中无法按预期工作

转载 作者:行者123 更新时间:2023-12-02 09:05:21 27 4
gpt4 key购买 nike

我正在尝试理解 perl 中的 UTF8。

我有以下字符串 Alizéh。如果我查找此字符串的十六进制,我会从 https://onlineutf8tools.com/convert-utf8-to-hexadecimal 得到 416c697ac3a968 (这与该字符串的原始来源匹配)。

所以我认为打包十六进制并将其编码为 utf8 应该会生成 unicode 字符串。但它产生了一些非常不同的东西。

有谁能解释我的错误吗?

这是一个简单的测试程序来展示我的工作。

#!/usr/bin/perl

use strict;
use warnings;

use Text::Unaccent;
use Encode;

use utf8;
binmode STDOUT, ':encoding(UTF-8)';

print "First test that the utf8 string Alizéh prints as expected\n\n";

print "=========================================== Hex to utf8 test start\n";

my $hexRepresentationOfTheString = '416c697ac3a968';
my $packedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "The hex of the string is $hexRepresentationOfTheString\n";
print "The string after packing prints as $packedHexIntoPlainString\n";
utf8::encode($packedHexIntoPlainString);
print "Utf8 encoding the string produces $packedHexIntoPlainString\n";

print "=========================================== Hex to utf8 test finish\n\n";

print "=========================================== utf8 from code test start\n";
my $utf8FromCode = "Alizéh";
print "Variable prints as $utf8FromCode\n";

my ($hex) = unpack("H*", $utf8FromCode);

print "Hex of this string is now $hex\n";

print "Decoding the utf8 string\n";
utf8::decode($utf8FromCode);

$hex = unpack ("H*", $utf8FromCode);
print "Hex string is now $hex\n";

print "=========================================== utf8 from code test finish\n\n";

这打印:

First test that the utf8 string Alizéh prints as expected

=========================================== Hex to utf8 test start
The hex of the string is 416c697ac3a968
The string after packing prints as Alizéh
Utf8 encoding the string produces Alizéh
=========================================== Hex to utf8 test finish

=========================================== utf8 from code test start
Variable prints as Alizéh
Hex of this string is now 416c697ae968
Decoding the utf8 string
Hex string is now 416c697ae968
=========================================== utf8 from code test finish

关于如何获取 UTF8 字符串的十六进制值并将其转换为 perl 中有效的 UTF8 标量的任何提示?

我将在这个扩展版本中解释一些更奇怪的地方

#!/usr/bin/perl

use strict;
use warnings;

use Text::Unaccent;
use Encode;

use utf8;
binmode STDOUT, ':encoding(UTF-8)';

print "First test that the utf8 string Alizéh prints as expected\n\n";

print "=========================================== Hex to utf8 test start\n";

my $hexRepresentationOfTheString = '416c697ac3a968';
my $packedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "The hex of the string is $hexRepresentationOfTheString\n";
print "The string after packing prints as $packedHexIntoPlainString\n";
utf8::encode($packedHexIntoPlainString);
print "Utf8 encoding the string produces $packedHexIntoPlainString\n";

print "=========================================== Hex to utf8 test finish\n\n";

print "=========================================== utf8 from code test start\n";
my $utf8FromCode = "Alizéh";
print "Variable prints as $utf8FromCode\n";

my ($hex) = unpack("H*", $utf8FromCode);

print "Hex of this string is now $hex\n";

print "Decoding the utf8 string\n";
utf8::decode($utf8FromCode);

$hex = unpack ("H*", $utf8FromCode);
print "Hex string is now $hex\n";

print "=========================================== utf8 from code test finish\n\n";

print "=========================================== Unaccent test start\n";

my $plaintest = unac_string('utf8', "Alizéh");

print "Alizéh passed to the unaccent gives $plaintest\n";


my $cleanpackedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "Packed version of the hex string prints as $cleanpackedHexIntoPlainString\n";

my $packedtest = unac_string('utf8', $cleanpackedHexIntoPlainString);

print "Unaccenting the packed version gives $packedtest\n";

utf8::encode($cleanpackedHexIntoPlainString);
print "encoding the packed version it now prints as $cleanpackedHexIntoPlainString\n";

$packedtest = unac_string('utf8', $cleanpackedHexIntoPlainString);

print "Now unaccenting the packed version gives $packedtest\n";

print "=========================================== Unaccent test finish\n\n";

这打印:

First test that the utf8 string Alizéh prints as expected

=========================================== Hex to utf8 test start
The hex of the string is 416c697ac3a968
The string after packing prints as Alizéh
Utf8 encoding the string produces Alizéh
=========================================== Hex to utf8 test finish

=========================================== utf8 from code test start
Variable prints as Alizéh
Hex of this string is now 416c697ae968
Decoding the utf8 string
Hex string is now 416c697ae968
=========================================== utf8 from code test finish

=========================================== Unaccent test start
Alizéh passed to the unaccent gives Alizeh
Packed version of the hex string prints as Alizéh
Unaccenting the packed version gives Alizeh
encoding the packed version it now prints as Alizéh
Now unaccenting the packed version gives AlizA©h
=========================================== Unaccent test finish

在这个测试中,似乎 unaccent 库接受字符串 hex 的压缩版本。我不确定为什么,有人可以帮助我理解为什么会这样吗?

最佳答案

Unicode 字符串是 Perl 中的一流值,您无需跳过这些步骤。你只需要识别和跟踪什么时候有字节,什么时候有字符,Perl 不会为你区分,所有字节串也是有效的字符串。实际上,您正在对字符串进行双重编码,这些字符串仍然有效,因为 UTF-8 编码字节表示(对应于的字符)您的 UTF-8 编码字节。

use utf8; 将从 UTF-8 解码您的源代码,因此通过声明您的以下文字字符串已经是 unicode 字符串并且可以传递给任何正确接受字符的 API。要从一串 UTF-8 字节中获取相同的内容(正如您通过打包字节的十六进制表示生成的那样),请使用 decode from Encode (或我的 nicer wrapper )。

use strict;
use warnings;
use utf8;
use Encode 'decode';

my $str = 'Alizéh'; # already decoded
my $hex = '416c697ac3a968';
my $bytes = pack 'H*', $hex;
my $chars = decode 'UTF-8', $bytes;

Unicode 字符串需要编码为 UTF-8,以便输出到需要字节的内容,例如 STDOUT; :encoding(UTF-8) 层可以应用于此类句柄以自动执行此操作,同样可以从输入句柄自动解码。应该应用什么的确切性质完全取决于你的角色来自哪里以及他们要去哪里。参见 this answer有关可用选项的太多信息。

use Encode 'encode';
print encode 'UTF-8', "$chars\n";
binmode *STDOUT, ':encoding(UTF-8)'; # warning: global effect
print "$chars\n";

关于perl - 将十六进制转换为 UTF8 在 perl 中无法按预期工作,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59276286/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com