C# 将混合语言的字符串拆分为不同的语言 block-6ren

C# 将混合语言的字符串拆分为不同的语言 block

转载作者：行者123 更新时间：2023-11-30 13:44:04

24

4

我正在尝试解决一个问题，其中我有一个包含混合语言的字符串作为输入。

E.g. "Hyundai Motor Company 현대자동차 现代 Some other English words"

我想将字符串拆分成不同的语言 block 。

E.g. ["Hyundai Motor Company", "현대자동차", "现代", "Some other English words"]

或(空格/标点符号和顺序无关紧要)

["HyundaiMotorCompany", "현대자동차", "现代", "SomeotherEnglishwords"]

有没有简单的方法可以解决这个问题？或者我可以使用的任何程序集/nuget 包？

谢谢

编辑:我认为我的“语言 block ”是模棱两可的。我想要的“语言 block ”是语言字符集。

For example "Hyundai Motor Company" is in English character set, "현대자동차" in Korean set, "现代" in Chinese set, "Some other English words" in English set.

澄清我的问题要求的补充是:

1:输入可以有空格或任何其他标点符号，但我总是可以使用正则表达式来忽略它们。

2:我将预处理输入以忽略变音符号。所以“å”在我的输入中变成了“a”。所以所有的英文字符都会变成英文字符。

我真正想要的是找到一种方法将输入解析为不同的语言字符集，忽略空格和标点符号。

E.g. From "HyundaiMotorCompany현대자동차现代SomeotherEnglishwords"

To ["HyundaiMotorCompany", "현대자동차", "现代", "SomeotherEnglishwords"]

最佳答案

语言 block 可以使用 UNICODE block 来定义。当前的 UNICODE block 列表可在 ftp://www.unicode.org/Public/UNIDATA/Blocks.txt 获得。 .以下是列表的摘录:

0000..007F; Basic Latin0080..00FF; Latin-1 Supplement0100..017F; Latin Extended-A0180..024F; Latin Extended-B0250..02AF; IPA Extensions02B0..02FF; Spacing Modifier Letters0300..036F; Combining Diacritical Marks0370..03FF; Greek and Coptic0400..04FF; Cyrillic0500..052F; Cyrillic Supplement

The idea is to classify the characters using the UNICODE block. Consecutive characters belonging to the same UNICODE block define a language chunk.

First problem with this definition is that what you might consider a single script (or language) spans several blocks like Cyrillic and Cyrillic Supplement. To handle this you can merge blocks containing the same name so all Latin blocks are merged into a single Latin script etc.

However, this creates several new problems:

Should the blocks Greek and Coptic, Coptic and Greek Supplement be merged into a single script or should you try to make a distinction between Greek and Coptic script?
You should probably merge all the CJK blocks. However, because these blocks contain both Chinese as well as Kanji (Japanese) and Hanja (Korean) characters you will not be able to distinguish between these scripts when CJK characters are used.

Assuming that you have a plan for how to use UNICODE blocks to classify characters into scripts you then have to decide how to handle spacing and punctuation. The space character and several forms of punctuation belong to the Basic Latin block. However, other blocks may also contain non-letter characters.

A strategy for dealing with this is to "ignore" the UNICODE block of non-letter characters but include them in chunks. In your example you have two non-latin chunks that happens to not contain space or punctuation but many scripts will use space as it is used in the latin script, e.g. Cyrillic. Even though a space is classifed as Latin you still want a sequence of words in Cyrillic separated by spaces to be considered a single chunk using the Cyrillic script instead of a Cyrillic word followed by a Latin space and then another Cyrillic word etc.

Finally, you need to decide how to handle numbers. You can treat them as space and punctuation or classify them as the block they belong to, e.g. Latin digits are Latin while Devanagari digits are Devanagari etc.

Here is some code putting all this together. First a class to represent a script (based on UNICODE blocks like "Greek and Coptic": 0x0370 - 0x03FF):

public class Script
{
    public Script(int from, int to, string name)
    {
        From = from;
        To = to;
        Name = name;
    }

    public int From { get; }
    public int To { get; }
    public string Name { get; }

    public bool Contains(char c) => From <= (int) c && (int) c <= To;
}

接下来是一个用于下载和解析 UNICODE block 文件的类。此代码在可能不理想的构造函数中下载文本。相反，您可以使用文件的本地副本或类似的东西。

public class Scripts
{
    readonly List<Script> scripts;

    public Scripts()
    {
        using (var webClient = new WebClient())
        {
            const string url = "ftp://www.unicode.org/Public/UNIDATA/Blocks.txt";
            var blocks = webClient.DownloadString(url);
            var regex = new Regex(@"^(?<from>[0-9A-F]{4})\.\.(?<to>[0-9A-F]{4}); (?<name>.+)$");
            scripts = blocks
                .Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries)
                .Select(line => regex.Match(line))
                .Where(match => match.Success)
                .Select(match => new Script(
                    Convert.ToInt32(match.Groups["from"].Value, 16),
                    Convert.ToInt32(match.Groups["to"].Value, 16),
                    NormalizeName(match.Groups["name"].Value)))
                .ToList();
        }
    }

    public string GetScript(char c)
    {
        if (!char.IsLetterOrDigit(c))
            // Use the empty string to signal space and punctuation.
            return string.Empty;
        // Linear search - can be improved by using binary search.
        foreach (var script in scripts)
            if (script.Contains(c))
                return script.Name;
        return string.Empty;
    }

    // Add more special names if required.
    readonly string[] specialNames = new[] { "Latin", "Cyrillic", "Arabic", "CJK" };

    string NormalizeName(string name) => specialNames.FirstOrDefault(sn => name.Contains(sn)) ?? name;
}

请注意，UNICODE 代码点 0xFFFF 以上的 block 将被忽略。如果您必须使用这些字符，则必须对我提供的代码进行大量扩展，这些代码假定 UNICODE 字符由 16 位值表示。

下一个任务是将字符串拆分为 UNICODE block 。它将返回由属于同一脚本(元组的第二个元素)的一串连续字符组成的单词。 scripts 变量是上面定义的 Scripts 类的实例。

public IEnumerable<(string text, string script)> SplitIntoWords(string text)
{
    if (text.Length == 0)
        yield break;
    var script = scripts.GetScript(text[0]);
    var start = 0;
    for (var i = 1; i < text.Length - 1; i += 1)
    {
        var nextScript = scripts.GetScript(text[i]);
        if (nextScript != script)
        {
            yield return (text.Substring(start, i - start), script);
            start = i;
            script = nextScript;
        }
    }
    yield return (text.Substring(start, text.Length - start), script);
}

在您的文本上执行 SplitIntoWords 将返回如下内容:

Text      | Script----------+----------------Hyundai   | Latin[space]   | [empty string]Motor     | Latin[space]   | [empty string]Company   | Latin[space]   | [empty string]현대자동차 | Hangul Syllables[space]   | [empty string]现代      | CJK...

Next step is to join consecutive words belonging to the same script ignoring space and punctuation:

public IEnumerable<string> JoinWords(IEnumerable<(string text, string script)> words)
{
    using (var enumerator = words.GetEnumerator())
    {
        if (!enumerator.MoveNext())
            yield break;
        var (text, script) = enumerator.Current;
        var stringBuilder = new StringBuilder(text);
        while (enumerator.MoveNext())
        {
            var (nextText, nextScript) = enumerator.Current;
            if (script == string.Empty)
            {
                stringBuilder.Append(nextText);
                script = nextScript;
            }
            else if (nextScript != string.Empty && nextScript != script)
            {
                yield return stringBuilder.ToString();
                stringBuilder = new StringBuilder(nextText);
                script = nextScript;
            }
            else
                stringBuilder.Append(nextText);
        }
        yield return stringBuilder.ToString();
    }
}

此代码将使用相同的脚本将任何空格和标点符号包含在前面的单词中。

综合起来:

var chunks = JoinWords(SplitIntoWords(text));

这将导致这些 block :

现代汽车公司
현대자동차
现代
一些其他的英语单词

除最后一个之外的所有 block 都有尾随空格。

关于C# 将混合语言的字符串拆分为不同的语言 block ，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/45619497/

24

4

0

文章推荐： asp.net - 如何使用 javascript 在 asp.net 中添加 2 个文本框值

文章推荐： ios - Swift:CoreData – ManagedObjectContext-错误

java - Arraylist 分为 -
我有一个数组列表: ArrayList allText = new ArrayList(); 其内容是这样的: [Alabama - Montgomery, Alaska - Juneau, Ariz
php - 开始和结束时间，分为 1 小时段
我有一个 timestamp 格式的开始和结束时间。我想将它们分成多个时间段，例如 1 小时。 $t1 = strtotime('2010-05-06 12:00:00'); $t2 = strtot
css - span10 分为 3 列
我需要将 span10 分成 3 列，但我无法将它们排列起来。我应该在 span10 中添加一个 span12 还是使用 offset 还是？？
Pandas - 分为 24 小时区 block ，但不是午夜到午夜
我有一个时间序列。我想从早上 8 点到第二天早上 7:59 分成 24 小时的区 block 。我知道如何按日期分组，但我尝试过使用 TimeGroupers 和 DateOffsets 处理这个 8
java - Android Java 分为 4 个整数
我收到“街道号码邮政编码城市”形式的地址(作为字符串)。我想要做的是将街道和号码与邮政编码和城市分开。通常你可以按空格分割。但有些街道名称中也有空格，例如:“Emile Van Ermengemlaa
java - 将 JList 分为 2 组的优化方法
我有一个用户列表。其中一些用户处于第一状态，而其他用户处于第二状态。所以我想要的是将这个列表显示为首先，它按排序顺序显示存在 = 1 的用户，然后按排序顺序显示存在 = 2 的用户。这里的排序是根据用
javascript - 将 div 分为 3 列
我感觉我搜索了整个网络，但找不到一种方法将不同高度的 div 很好地划分为 3 列，就像 http://www.ing.nl 上那样 headertekst headerteksttesth
css - td 内的 Bootstrap 按钮下拉菜单，分为 2 行
Bootstrap 3 按钮下拉菜单出现问题。你可以在这里看到我的两个例子: http://www.bootply.com/W1dLusilMk http://www.bootply.com/GGBv
javascript - 返回的 php JSON 分为 2 个 Javascript 对象
我在 php 中执行以下操作 foreach($QuestionAsekd as $k => $v){ $grp_name = $v['NAME']; $groupValues[$gr
python - Pandas DataFrame [cell=(label,value)]，分为 2 个独立的数据框
我找到了一种用pandas解析html的绝妙方法。我的数据格式有点奇怪(见下文)。我想将这些数据拆分为 2 个单独的数据帧。注意每个单元格如何由，分隔...是否有任何真正有效的方法来分割所有这些单元
html - CSS
分为 2 列。没有

首页

博学

6Ren·AI

商城

C# 将混合语言的字符串拆分为不同的语言 block