gpt4 book ai didi

C# 将混合语言的字符串拆分为不同的语言 block

转载 作者:行者123 更新时间:2023-11-30 13:44:04 24 4
gpt4 key购买 nike

我正在尝试解决一个问题,其中我有一个包含混合语言的字符串作为输入。

E.g. "Hyundai Motor Company 현대자동차 现代 Some other English words"

我想将字符串拆分成不同的语言 block

E.g. ["Hyundai Motor Company", "현대자동차", "现代", "Some other English words"]

或(空格/标点符号和顺序无关紧要)

["HyundaiMotorCompany", "현대자동차", "现代", "SomeotherEnglishwords"]

有没有简单的方法可以解决这个问题?或者我可以使用的任何程序集/nuget 包?

谢谢

编辑:我认为我的“语言 block ”是模棱两可的。我想要的“语言 block ”是语言字符集。

For example "Hyundai Motor Company" is in English character set, "현대자동차" in Korean set, "现代" in Chinese set, "Some other English words" in English set.

澄清我的问题要求的补充是:

1:输入可以有空格或任何其他标点符号,但我总是可以使用正则表达式来忽略它们。

2:我将预处理输入以忽略变音符号。所以“å”在我的输入中变成了“a”。所以所有的英文字符都会变成英文字符。

我真正想要的是找到一种方法将输入解析为不同的语言字符集,忽略空格和标点符号。

E.g. From "HyundaiMotorCompany현대자동차现代SomeotherEnglishwords"

To ["HyundaiMotorCompany", "현대자동차", "现代", "SomeotherEnglishwords"]

最佳答案

语言 block 可以使用 UNICODE block 来定义。当前的 UNICODE block 列表可在 ftp://www.unicode.org/Public/UNIDATA/Blocks.txt 获得。 .以下是列表的摘录:

0000..007F; Basic Latin0080..00FF; Latin-1 Supplement0100..017F; Latin Extended-A0180..024F; Latin Extended-B0250..02AF; IPA Extensions02B0..02FF; Spacing Modifier Letters0300..036F; Combining Diacritical Marks0370..03FF; Greek and Coptic0400..04FF; Cyrillic0500..052F; Cyrillic Supplement

The idea is to classify the characters using the UNICODE block. Consecutive characters belonging to the same UNICODE block define a language chunk.

First problem with this definition is that what you might consider a single script (or language) spans several blocks like Cyrillic and Cyrillic Supplement. To handle this you can merge blocks containing the same name so all Latin blocks are merged into a single Latin script etc.

However, this creates several new problems:

  1. Should the blocks Greek and Coptic, Coptic and Greek Supplement be merged into a single script or should you try to make a distinction between Greek and Coptic script?
  2. You should probably merge all the CJK blocks. However, because these blocks contain both Chinese as well as Kanji (Japanese) and Hanja (Korean) characters you will not be able to distinguish between these scripts when CJK characters are used.

Assuming that you have a plan for how to use UNICODE blocks to classify characters into scripts you then have to decide how to handle spacing and punctuation. The space character and several forms of punctuation belong to the Basic Latin block. However, other blocks may also contain non-letter characters.

A strategy for dealing with this is to "ignore" the UNICODE block of non-letter characters but include them in chunks. In your example you have two non-latin chunks that happens to not contain space or punctuation but many scripts will use space as it is used in the latin script, e.g. Cyrillic. Even though a space is classifed as Latin you still want a sequence of words in Cyrillic separated by spaces to be considered a single chunk using the Cyrillic script instead of a Cyrillic word followed by a Latin space and then another Cyrillic word etc.

Finally, you need to decide how to handle numbers. You can treat them as space and punctuation or classify them as the block they belong to, e.g. Latin digits are Latin while Devanagari digits are Devanagari etc.

Here is some code putting all this together. First a class to represent a script (based on UNICODE blocks like "Greek and Coptic": 0x0370 - 0x03FF):

public class Script
{
public Script(int from, int to, string name)
{
From = from;
To = to;
Name = name;
}

public int From { get; }
public int To { get; }
public string Name { get; }

public bool Contains(char c) => From <= (int) c && (int) c <= To;
}

接下来是一个用于下载和解析 UNICODE block 文件的类。此代码在可能不理想的构造函数中下载文本。相反,您可以使用文件的本地副本或类似的东西。

public class Scripts
{
readonly List<Script> scripts;

public Scripts()
{
using (var webClient = new WebClient())
{
const string url = "ftp://www.unicode.org/Public/UNIDATA/Blocks.txt";
var blocks = webClient.DownloadString(url);
var regex = new Regex(@"^(?<from>[0-9A-F]{4})\.\.(?<to>[0-9A-F]{4}); (?<name>.+)$");
scripts = blocks
.Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries)
.Select(line => regex.Match(line))
.Where(match => match.Success)
.Select(match => new Script(
Convert.ToInt32(match.Groups["from"].Value, 16),
Convert.ToInt32(match.Groups["to"].Value, 16),
NormalizeName(match.Groups["name"].Value)))
.ToList();
}
}

public string GetScript(char c)
{
if (!char.IsLetterOrDigit(c))
// Use the empty string to signal space and punctuation.
return string.Empty;
// Linear search - can be improved by using binary search.
foreach (var script in scripts)
if (script.Contains(c))
return script.Name;
return string.Empty;
}

// Add more special names if required.
readonly string[] specialNames = new[] { "Latin", "Cyrillic", "Arabic", "CJK" };

string NormalizeName(string name) => specialNames.FirstOrDefault(sn => name.Contains(sn)) ?? name;
}

请注意,UNICODE 代码点 0xFFFF 以上的 block 将被忽略。如果您必须使用这些字符,则必须对我提供的代码进行大量扩展,这些代码假定 UNICODE 字符由 16 位值表示。

下一个任务是将字符串拆分为 UNICODE block 。它将返回由属于同一脚本(元组的第二个元素)的一串连续字符组成的单词。 scripts 变量是上面定义的 Scripts 类的实例。

public IEnumerable<(string text, string script)> SplitIntoWords(string text)
{
if (text.Length == 0)
yield break;
var script = scripts.GetScript(text[0]);
var start = 0;
for (var i = 1; i < text.Length - 1; i += 1)
{
var nextScript = scripts.GetScript(text[i]);
if (nextScript != script)
{
yield return (text.Substring(start, i - start), script);
start = i;
script = nextScript;
}
}
yield return (text.Substring(start, text.Length - start), script);
}

在您的文本上执行 SplitIntoWords 将返回如下内容:

Text      | Script----------+----------------Hyundai   | Latin[space]   | [empty string]Motor     | Latin[space]   | [empty string]Company   | Latin[space]   | [empty string]현대자동차 | Hangul Syllables[space]   | [empty string]现代      | CJK...

Next step is to join consecutive words belonging to the same script ignoring space and punctuation:

public IEnumerable<string> JoinWords(IEnumerable<(string text, string script)> words)
{
using (var enumerator = words.GetEnumerator())
{
if (!enumerator.MoveNext())
yield break;
var (text, script) = enumerator.Current;
var stringBuilder = new StringBuilder(text);
while (enumerator.MoveNext())
{
var (nextText, nextScript) = enumerator.Current;
if (script == string.Empty)
{
stringBuilder.Append(nextText);
script = nextScript;
}
else if (nextScript != string.Empty && nextScript != script)
{
yield return stringBuilder.ToString();
stringBuilder = new StringBuilder(nextText);
script = nextScript;
}
else
stringBuilder.Append(nextText);
}
yield return stringBuilder.ToString();
}
}

此代码将使用相同的脚本将任何空格和标点符号包含在前面的单词中。

综合起来:

var chunks = JoinWords(SplitIntoWords(text));

这将导致这些 block :

  • 现代汽车公司
  • 현대자동차
  • 现代
  • 一些其他的英语单词

除最后一个之外的所有 block 都有尾随空格。

关于C# 将混合语言的字符串拆分为不同的语言 block ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45619497/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com