unicode - 这个混合字符串如何在 unicode 单词边界上拆分-6ren

unicode - 这个混合字符串如何在 unicode 单词边界上拆分

转载作者：行者123 更新时间：2023-12-03 11:23:47

考虑字符串 "abc를" .根据 unicode 的 demo implementation of word segmentation ，这个字符串应该分成两个词，"abc"和 "를" .然而，词边界检测的 3 种不同 Rust 实现( regex 、 unic-segment 、 unicode-segmentation )都不同意，并将该字符串分组为一个词。哪种行为是正确的？
作为跟进，如果分组行为是正确的，那么以仍然主要尊重单词边界的方式扫描此字符串以查找搜索词“abc”的好方法是什么(目的是检查字符串翻译的有效性) .我想匹配类似 "abc를" 的东西但不要匹配 abcdef 之类的东西.

最佳答案

我不太确定分词演示是否应该被视为基本事实，即使它是在官方网站上。例如，它考虑 "abc를" ( "abc\uB97C" ) 是两个单独的词，但认为 "abc를" ( "abc\u1105\u1173\u11af" ) 是一个，即使前者分解为后者。
单词边界的想法并不是一成不变的。 Unicode 有一个 Word Boundary概述了应该和不应该发生断字的地方的规范。但是，它有一个广泛的注释部分来详细说明其他案例(重点是我的):

It is not possible to provide a uniform set of rules that resolves all issues across languages or that handles all ambiguous situations within a given language. The goal for the specification presented in this annex is to provide a workable default; tailored implementations can be more sophisticated.

For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.

...

我的理解是，您列出的 crate 符合规范，无需进一步的上下文分析。为什么演示不同意我不能说，但它可能是尝试实现这些边缘情况之一。

为了解决您的具体问题，我建议使用 Regex 与 \b用于匹配单词边界。不幸的是，这遵循相同的 unicode 规则，不会考虑 "를"成为一个新词。但是，此正则表达式实现提供了 escape hatch回退到 ascii 行为。只需使用 (?-u:\b)匹配非 unicode 边界:

use regex::Regex;

fn main() {
    let pattern = Regex::new("(?-u:\\b)abc(?-u:\\b)").unwrap();
    println!("{:?}", pattern.find("some abcdef abc를 sentence"));
}

您可以在 playground 上为自己运行它测试您的案例，看看这是否适合您。

关于unicode - 这个混合字符串如何在 unicode 单词边界上拆分，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/66081519/