gpt4 book ai didi

perl - Perl 中的多语言文本排序,在 Windows 上,使用区域设置

转载 作者:行者123 更新时间:2023-12-04 22:22:00 24 4
gpt4 key购买 nike

我正在构建一个用于对不同语言的书籍索引进行排序的软件。它使用 Perl,并脱离语言环境。我正在 Unix 上开发它,但它需要可移植到 Windows。这应该在原则上起作用,还是依靠语言环境,我是不是找错了树?最重要的是,Windows 确实是我需要它工作的地方,但我更喜欢在我的 UNIX 环境中进行开发。

最佳答案

假设您的起点是 Unicode,因为您一直非常小心地解码所有传入的数据,无论其 native 编码可能是什么,那么它很容易使用到 Unicode::Collate模块作为起点。

如果您想要区域设置定制,那么您可能希望从 Unicode::Collate::Locale 开始。反而。

解码成 Unicode

如果你在全 UTF8 环境中运行,这很容易,但如果你受制于随机的所谓“语言环境”(或者更糟糕的是,微软称之为“代码页”的丑陋事物)的变迁,那么你可能想要获取 CPAN Encode::Locale模块来帮助你。例如:

 use Encode;
use Encode::Locale;

# use "locale" as an arg to encode/decode
@ARGV = map { decode(locale => $_) } @ARGV;

# or as a stream for binmode or open
binmode $some_fh, ":encoding(locale)";

binmode STDIN, ":encoding(console_in)" if -t STDIN;
binmode STDOUT, ":encoding(console_out)" if -t STDOUT;
binmode STDERR, ":encoding(console_out)" if -t STDERR;

(如果是我,我会使用 ":utf8" 作为输出。)

标准整理,加上语言环境和剪裁

关键是,一旦您将所有内容解码为内部 Perl 格式,您就可以使用 Unicode::CollateUnicode::Collate::Locale在上面。这些真的很容易:
   use v5.14;
use utf8;
use Unicode::Collate;
my @exes = qw( x⁷ x⁰ x⁸ x³ x⁶ x⁵ x⁴ x² x⁹ x¹ );
@exes = Unicode::Collate->new->sort(@exes);
say "@exes";

# prints: x⁰ x¹ x² x³ x⁴ x⁵ x⁶ x⁷ x⁸ x⁹

或者他们可以很花哨。这是一个尝试处理书名的方法:它去除了主要文章和零填充数字。
my $collator = Unicode::Collate->new(
--upper_before_lower => 1,
--preprocess => {
local $_ = shift;
s/^ (?: The | An? ) \h+ //x; # strip articles
s/ ( \d+ ) / sprintf "%020d", $1 /xeg;
return $_;
};
);

现在只需使用该对象的 sort排序的方法。

有时你需要把排序翻过来。例如:
 my $collator = Unicode::Collate->new();
for my $rec (@recs) {
$rec->{NAME_key} =
$collator->getSortKey( $rec->{NAME} );
}
@srecs = sort {
$b->{AGE} <=> $a->{AGE}
||
$a->{NAME_key} cmp $b->{NAME_key}
} @recs;

您必须这样做的原因是因为您正在对具有各种字段的记录进行排序。二进制排序键允许您使用 cmp已通过您选择/自定义整理器对象的数据的运算符。

collat​​or 对象的完整构造函数具有正式语法的所有这些:
      $Collator = Unicode::Collate->new(
UCA_Version => $UCA_Version,
alternate => $alternate, # alias for 'variable'
backwards => $levelNumber, # or \@levelNumbers
entry => $element,
hangul_terminator => $term_primary_weight,
highestFFFF => $bool,
identical => $bool,
ignoreName => qr/$ignoreName/,
ignoreChar => qr/$ignoreChar/,
ignore_level2 => $bool,
katakana_before_hiragana => $bool,
level => $collationLevel,
minimalFFFE => $bool,
normalization => $normalization_form,
overrideCJK => \&overrideCJK,
overrideHangul => \&overrideHangul,
preprocess => \&preprocess,
rearrange => \@charList,
rewrite => \&rewrite,
suppress => \@charList,
table => $filename,
undefName => qr/$undefName/,
undefChar => qr/$undefChar/,
upper_before_lower => $bool,
variable => $variable,
);

但是您通常不必担心几乎任何这些。事实上,如果您想要使用 CLDR 数据进行特定国家/地区的区域设置定制,您应该只使用 Unicode::Collate::Locale ,它正好向构造函数添加了一个参数: locale => $country_code .
 use Unicode::Collate::Locale;
$coll = Unicode::Collate::Locale->
new(locale => "fr");
@french_text = $coll->sort(@french_text);

看看这有多容易?

但你也可以做其他很酷的事情。
 use Unicode::Collate::Locale;
my $Collator = new Unicode::Collate::Locale::
locale => "de__phonebook",
level => 1,
normalization => undef,
;

my $full = "Ich müß Perl studieren.";
my $sub = "MUESS";
if (my ($pos,$len) = $Collator->index($full, $sub)) {
my $match = substr($full, $pos, $len);
say "Found match of literal ‹$sub› in ‹$full› as ‹$match›";

}

运行时,它说:

 Found match of literal ‹MUESS› in ‹Ich müß Perl studieren.› as ‹müß›

以下是 Unicode::Collate::Locale 的 v0.96 起可用的语言环境模块,取自其手册页:

 locale name       description
--------------------------------------------------------------
af Afrikaans
ar Arabic
as Assamese
az Azerbaijani (Azeri)
be Belarusian
bg Bulgarian
bn Bengali
bs Bosnian
bs_Cyrl Bosnian in Cyrillic (tailored as Serbian)
ca Catalan
cs Czech
cy Welsh
da Danish
de__phonebook German (umlaut as 'ae', 'oe', 'ue')
ee Ewe
eo Esperanto
es Spanish
es__traditional Spanish ('ch' and 'll' as a grapheme)
et Estonian
fa Persian
fi Finnish (v and w are primary equal)
fi__phonebook Finnish (v and w as separate characters)
fil Filipino
fo Faroese
fr French
gu Gujarati
ha Hausa
haw Hawaiian
hi Hindi
hr Croatian
hu Hungarian
hy Armenian
ig Igbo
is Icelandic
ja Japanese [1]
kk Kazakh
kl Kalaallisut
kn Kannada
ko Korean [2]
kok Konkani
ln Lingala
lt Lithuanian
lv Latvian
mk Macedonian
ml Malayalam
mr Marathi
mt Maltese
nb Norwegian Bokmal
nn Norwegian Nynorsk
nso Northern Sotho
om Oromo
or Oriya
pa Punjabi
pl Polish
ro Romanian
ru Russian
sa Sanskrit
se Northern Sami
si Sinhala
si__dictionary Sinhala (U+0DA5 = U+0DA2,0DCA,0DA4)
sk Slovak
sl Slovenian
sq Albanian
sr Serbian
sr_Latn Serbian in Latin (tailored as Croatian)
sv Swedish (v and w are primary equal)
sv__reformed Swedish (v and w as separate characters)
ta Tamil
te Telugu
th Thai
tn Tswana
to Tonga
tr Turkish
uk Ukrainian
ur Urdu
vi Vietnamese
wae Walser
wo Wolof
yo Yoruba
zh Chinese
zh__big5han Chinese (ideographs: big5 order)
zh__gb2312han Chinese (ideographs: GB-2312 order)
zh__pinyin Chinese (ideographs: pinyin order) [3]
zh__stroke Chinese (ideographs: stroke order) [3]
zh__zhuyin Chinese (ideographs: zhuyin order) [3]

Locales according to the default UCA rules include chr (Cherokee), de (German), en (English), ga (Irish), id (Indonesian),
it (Italian), ka (Georgian), ms (Malay), nl (Dutch), pt (Portuguese), st (Southern Sotho), sw (Swahili), xh (Xhosa), zu
(Zulu).

Note

[1] ja: Ideographs are sorted in JIS X 0208 order. Fullwidth and halfwidth forms are identical to their regular form. The
difference between hiragana and katakana is at the 4th level, the comparison also requires "(variable => 'Non-ignorable')",
and then "katakana_before_hiragana" has no effect.

[2] ko: Plenty of ideographs are sorted by their reading. Such an ideograph is primary (level 1) equal to, and secondary
(level 2) greater than, the corresponding hangul syllable.

[3] zh__pinyin, zh__stroke and zh__zhuyin: implemented alt='short', where a smaller number of ideographs are tailored.

Note: 'pinyin' is in latin, 'zhuyin' is in bopomofo.

总而言之,主要技巧是将您的本地数据解码为统一的 Unicode 表示,然后使用确定性排序,可能是定制的,不依赖于用户控制台窗口的随机设置来获得正确的行为。

注意:所有这些例子,除了手册页的引用,都是从第 4 版 Programming Perl 中摘取的,得到了​​作者的善意许可。 :)

关于perl - Perl 中的多语言文本排序,在 Windows 上,使用区域设置,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15013515/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com