gpt4 book ai didi

Perl , html 数据和以 utf-8 编码的字符

转载 作者:行者123 更新时间:2023-12-01 15:07:48 26 4
gpt4 key购买 nike

Perl 初学者。

我制作了一个 Perl 脚本来解析来自 html 站点的数据。我的脚本以 UTF-8 对数据进行编码,其中一个数据包含罗马尼亚字符,因此对数据进行编码会导致字符不正确,例如:

ţ = þ (incorrect); ş = º (incorrect); ă = ã (correct);

从 html 解析的行示例:

Distribuţia: Robert Downey Jr. (Sherlock Holmes) Jude Law (Dr. John Watson) Rachel McAdams (Irene Adler) Mark Strong (Lord Blackwood) Kelly Reilly (Mary Morstan) Eddie Marsan (Inspectorul Lestrade) James Fox (Sir Thomas)

我想将其拆分为:

my ($credits, $line)
foreach $credits (split /(?=\w+:)\s*/, $line) {
...

但是输出,因为“þ”被解释为“非单词字符”(这里换行不正确)是:

Distribuþ
Robert Downey Jr. (Sherlock Holmes)
Jude Law (Dr. John Watson)
Rachel McAdams (Irene Adler)
Mark Strong (Lord Blackwood)
Kelly Reilly (Mary Morstan)
Eddie Marsan (Inspectorul Lestrade)
James Fox (Sir Thomas)

想要的输出(正确):

Distribuţia
Robert Downey Jr. (Sherlock Holmes)
Jude Law (Dr. John Watson)
Rachel McAdams (Irene Adler)
Mark Strong (Lord Blackwood)
Kelly Reilly (Mary Morstan)
Eddie Marsan (Inspectorul Lestrade)
James Fox (Sir Thomas)

如果我使用“\p{Alpha}”变量而不是“\w”,部分解决问题(正确换行,但显示“Distribuþia”而不是“Distribuţia ”,可能发生在其他角色身上)看起来像这样(不正确):

Distribuþia
Robert Downey Jr. (Sherlock Holmes)
Jude Law (Dr. John Watson)
Rachel McAdams (Irene Adler)
Mark Strong (Lord Blackwood)
Kelly Reilly (Mary Morstan)
Eddie Marsan (Inspectorul Lestrade)
James Fox (Sir Thomas)

最佳答案

Text::Unidecode

>perl -MText::Unidecode -E"say unidecode qq{rom\x{00E2}n\x{0103}}"
romana

关于Perl , html 数据和以 utf-8 编码的字符,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7395905/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com