作者热门文章
- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
字符串通常按字符枚举。但是,特别是在使用 Unicode 和非英语语言时,有时我需要按字素枚举字符串。也就是说,组合标记和变音符号应与它们修改的基本字符保持一致。在 .Net 中执行此操作的最佳方法是什么?
用例:计算一系列 IPA 中不同的语音发音字。
最佳答案
简化场景
TextElementEnumerator非常有用和高效:
private static List<SoundCount> CountSounds(IEnumerable<string> words)
{
Dictionary<string, SoundCount> soundCounts = new Dictionary<string, SoundCount>();
foreach (var word in words)
{
TextElementEnumerator graphemeEnumerator = StringInfo.GetTextElementEnumerator(word);
while (graphemeEnumerator.MoveNext())
{
string grapheme = graphemeEnumerator.GetTextElement();
SoundCount count;
if (!soundCounts.TryGetValue(grapheme, out count))
{
count = new SoundCount() { Sound = grapheme };
soundCounts.Add(grapheme, count);
}
count.Count++;
}
}
return new List<SoundCount>(soundCounts.Values);
}
private static List<SoundCount> CountSoundsRegex(IEnumerable<string> words)
{
var soundCounts = new Dictionary<string, SoundCount>();
var graphemeExpression = new Regex(@"\P{M}\p{M}*");
foreach (var word in words)
{
Match graphemeMatch = graphemeExpression.Match(word);
while (graphemeMatch.Success)
{
string grapheme = graphemeMatch.Value;
SoundCount count;
if (!soundCounts.TryGetValue(grapheme, out count))
{
count = new SoundCount() { Sound = grapheme };
soundCounts.Add(grapheme, count);
}
count.Count++;
graphemeMatch = graphemeMatch.NextMatch();
}
}
return new List<SoundCount>(soundCounts.Values);
}
[\P{M}\P{Lm}] # Match a character that is NOT a character intended to be combined with another character or a special character that is used like a letter
(?: # Start a group for the combining characters:
(?: # Start a group for tied characters:
[\u035C\u0361] # Match an under- or over- tie bar...
\P{M}\p{M}* # ...followed by another grapheme (in the simplified sense)
) # (End the tied characters group)
|\p{M} # OR a character intended to be combined with another character
|\p{Lm} # OR a special character that is used like a letter
)* # Match the combining characters group zero or more times.
关于.net - 通过字素而不是字符枚举字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/2056866/
我是一名优秀的程序员,十分优秀!