gpt4 book ai didi

c++ - std::regex 总是能识别区域设置吗?

转载 作者:搜寻专家 更新时间:2023-10-31 02:07:54 24 4
gpt4 key购买 nike

std::basic_regex引用,std::regex 的构造函数的标志之一是 collat​​e,它指定:

Character ranges of the form "[a-b]" will be locale sensitive.

对我来说,这表明 std::regex 默认情况下不是(完全)区域设置感知的。我找不到任何声明它明确区域设置感知的东西,但是我们有 std::regex_traits 这有点表明有 some 语言环境感知正在进行中。

std::regex 语言环境感知到什么程度?是否可以读取 UTF-8 字符串并将其存储在普通 std::string 中并仅使用正则表达式类,例如 [:w:] [:punct:]?具体来说,[:w:] 可能是个问题。 [:punct:] 不重要。

这是针对必须在 MacOS(具有 UTF-8 语言环境)和 Windows(据我所知,不能)上运行的 C++ 库。

最佳答案

one of the flags for the constructor of a std::regex is collate, which specifies that:

Character ranges of the form "[a-b]" will be locale sensitive.

有关综合说明,请参阅 Regexp Ranges and Locales: A Long Sad Story :

However, the standard changed the interpretation of range expressions. In the "C" and "POSIX" locales, a range expression like ‘[a-dx-z]’ is still equivalent to ‘[abcdxyz]’, as in ASCII. But outside those locales, the ordering was defined to be based on collation order.

What does that mean? In many locales, ‘A’ and ‘a’ are both less than ‘B’. In other words, these locales sort characters in dictionary order, and ‘[a-dx-z]’ is typically not equivalent to ‘[abcdxyz]’; instead, it might be equivalent to ‘[ABCXYabcdxyz]’, for example.

This point needs to be emphasized: much literature teaches that you should use ‘[a-z]’ to match a lowercase character. But on systems with non-ASCII locales, this also matches all of the uppercase characters except ‘A’ or ‘Z’! This was a continuous cause of confusion, even well into the twenty-first century.


This indicates, to me, that std::regex is not, by default, (entirely) locale-aware.

不完全是。

Modified ECMAScript regular expression grammar它说:

Character classes

...

The exact meaning of each of these character class escapes in C++ is defined in terms of the locale-dependent named character classes, and not by explicitly listing the acceptable characters as in ECMAScript.

换句话说,它使用当前的全局区域设置用于字符类,如 [:alpha:]


Is it possible to read a UTF-8 string and store it in a plain std::string and just use regex classes such as [:w:] and [:punct:]? Specifically, [:w:] might be a problem. [:punct:] is not important.

不知道 std::string 的内容是什么编码,它们可能是 UTF-8 或任何其他编码。

您需要将std::string 解码为std::wstring,一种方法是使用std::codecvt_utf8 提供的工具。 ,然后使用 std::wregex

关于c++ - std::regex 总是能识别区域设置吗?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48222974/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com