gpt4 book ai didi

perl - 不同 Perls 对垂直制表符的不同处理

转载 作者:行者123 更新时间:2023-12-04 12:25:07 24 4
gpt4 key购买 nike

我有两个 Perl 程序,它们使用相同的库来处理文档。它们安装在两台不同的服务器上,一台运行 Perl 5.12,另一台运行 Perl 5.18。

现在我将相同的文件作为输入提供给两者,因此我可以比较输出以确保它们匹配。我得到了数百个相同的匹配项。他们通常处理 UTF-8 文件,我已经注意正确处理该编码。

今天他们都收到了一个二进制文件,我第一次看到了不同之处。我确定一个程序(运行 Perl 5.18 的程序)在输出文件之前从文件内容中删除了垂直制表符,而另一个程序则没有。

我可以将其注销为不支持二进制文件,但它们的不同仍然让我感到困扰。我查看了进行处理的库,它包含这一行(它将以这种方式处理文件中的每一行):

$line =~ s/\s//g;

有没有可能其中一个 Perls 认为垂直制表符是空格,而另一个则不是?我将如何检查?您认为我还有什么需要调查的吗?

最佳答案

从 5.18 开始,vertical tabs are considered whitespace .

No one could recall why \s didn't match \cK, the vertical tab. Now it does. Given the extreme rarity of that character, very little breakage is expected. That said, here's what it means:

\s in a regex now matches a vertical tab in all circumstances.

Literal vertical tabs in a regex literal are ignored when the /x modifier is used.

Leading vertical tabs, alone or mixed with other whitespace, are now ignored when interpreting a string as a number. For example:

$dec = " \cK \t 123";
$hex = " \cK \t 0xF";
say 0 + $dec; # was 0 with warning, now 123
say int $dec; # was 0, now 123
say oct $hex; # was 0, now 15

这使 Perl 符合 Unicode ,它认为 U+000B LINE TABULATION aka VERTICAL TABULATION aka VT 是一个 White_Space 字符。


您可以通过将 \s 替换为 [^\S\x0B] 来恢复旧行为。

还值得考虑的是 \h,它只匹配水平空白字符。

U+0009 CHARACTER TABULATION        Matched by \s & \h
U+000A LINE FEED Matched by \s & \v
U+000B LINE TABULATION Matched by \s & \v
U+000C FORM FEED Matched by \s & \v
U+000D CARRIAGE RETURN Matched by \s & \v
U+0020 SPACE Matched by \s & \h
U+0085 NEXT LINE Matched by \s & \v
U+00A0 NO-BREAK SPACE Matched by \s & \h
U+1680 OGHAM SPACE MARK Matched by \s & \h
U+2000 EN QUAD Matched by \s & \h
U+2001 EM QUAD Matched by \s & \h
U+2002 EN SPACE Matched by \s & \h
U+2003 EM SPACE Matched by \s & \h
U+2004 THREE-PER-EM SPACE Matched by \s & \h
U+2005 FOUR-PER-EM SPACE Matched by \s & \h
U+2006 SIX-PER-EM SPACE Matched by \s & \h
U+2007 FIGURE SPACE Matched by \s & \h
U+2008 PUNCTUATION SPACE Matched by \s & \h
U+2009 THIN SPACE Matched by \s & \h
U+200A HAIR SPACE Matched by \s & \h
U+2028 LINE SEPARATOR Matched by \s & \v
U+2029 PARAGRAPH SEPARATOR Matched by \s & \v
U+202F NARROW NO-BREAK SPACE Matched by \s & \h
U+205F MEDIUM MATHEMATICAL SPACE Matched by \s & \h
U+3000 IDEOGRAPHIC SPACE Matched by \s & \h

关于perl - 不同 Perls 对垂直制表符的不同处理,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49164191/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com