gpt4 book ai didi

c# - 函数/正则表达式匹配较大字符串中的字符串部分并突出显示这些部分

转载 作者:太空宇宙 更新时间:2023-11-03 15:07:24 25 4
gpt4 key购买 nike

我正在尝试构建一个函数,该函数采用搜索字符串并匹配较大字符串中的部分并突出显示它们。请参见下面的示例:

Original String: 由于飞机头顶空间有限,我向你保证,托运行李是免费的,我可以继续填写所有托运行李表你。

要搜索和突出显示的文本:免费,我填写表格

期望的结果:由于飞机上的头顶空间有限,我向您保证,托运行李将免费 ,我可以继续填写所有 托运行李表格

我可以搜索完整的字符串或使用子字符串一次搜索一个词,但两者都不会产生所需的结果。诀窍可能是以某种方式从完整字符串开始递归搜索,然后逐渐将其分解成更小的部分,直到部分匹配为止。有几个假设:

  • 搜索必须尽可能贪心,即先匹配字符串的较大部分,然后再尝试匹配较小的部分或单个单词。
  • 在找到任何匹配项后,搜索将始终向前推进,即如果在位置 x 找到前 2 个单词,则单词 3 和 4 将始终在 x 之后,而不是在 x 之前。

希望这是有道理的。谁能指出我正确的方向?我搜索了该网站,但没有找到与我要查找的内容类似的内容。

谢谢

最佳答案

如果这对您有帮助,请告诉我。它没有使用 Regex 来查找字符串,只是 IndexOf .

它首先获取要突出显示的单词 Tuple表示单词的开始索引和结束索引。

它使用围绕单词的前缀和后缀突出显示文本(此处:html 标记)。

static void Main(string[] args)
{
var input = "Since there is limited overhead space on the plane, I assure you, there will be no fee for checking the bags, I can go ahead and fill out all the checked baggage forms for you";
var searchExpression = "no fee, I fill out the forms";

var highlightedInput = HighlightString(input, searchExpression, "<b>", "</b>");

Console.WriteLine(highlightedInput);
Console.ReadLine();
}

public static IEnumerable<Tuple<int, int>> GetHighlights(string input, string searchExpression)
{
var splitIntoWordsRegex = new Regex(@"\W+");
var words = splitIntoWordsRegex.Split(searchExpression);
return GetHighlights(input, words);
}

public static IEnumerable<Tuple<int, int>> GetHighlights(string input, IEnumerable<string> searchExpression)
{
var highlights = new List<Tuple<int, int>>();

var lastMatchedIndex = 0;
foreach (var word in searchExpression)
{
var indexOfWord = input.IndexOf(word, lastMatchedIndex, StringComparison.CurrentCulture);
var lastIndexOfWord = indexOfWord + word.Length;

highlights.Add(new Tuple<int, int>(indexOfWord, lastIndexOfWord));

lastMatchedIndex = lastIndexOfWord;
}

return highlights;
}

public static string HighlightString(string input, string searchExpression, string highlightPrefix, string highlightSufix)
{
var highlights = GetHighlights(input, searchExpression).ToList();

var output = input;
for (int i = 0, j = highlights.Count; i<j; i++)
{
int diffInputOutput = output.Length - input.Length;
output = output.Insert(highlights[i].Item1 + diffInputOutput, highlightPrefix);

diffInputOutput = output.Length - input.Length;
output = output.Insert(highlights[i].Item2 + diffInputOutput, highlightSufix);
}

return output;
}

================== 编辑 ======================

为了减少突出显示的最小/最大索引,您可以使用下面的代码。虽然不是最漂亮的,但可以胜任。

它获取一个词的所有可用索引(感谢 Finding ALL positions of a substring in a large string in C# )。将它们添加到 highlights ,然后操作此集合以保持关闭匹配与您需要的匹配。

public static IEnumerable<Tuple<int, int>> GetHighlights(string input, IEnumerable<string> searchExpression)
{
var highlights = new List<Tuple<string, int, int>>();

// Finds all the indexes for
// all the words found.
foreach (var word in searchExpression)
{
var allIndexesOfWord = AllIndexesOf(input, word, StringComparison.InvariantCultureIgnoreCase);
highlights.AddRange(allIndexesOfWord.Select(index => new Tuple<string, int, int>(word, index, index + word.Length)));
}

// Reduce the scope of the highlights in order to
// keep the indexes as together as possible.
var firstWord = searchExpression.First();
var firstWordIndex = highlights.IndexOf(highlights.Last(x => String.Equals(x.Item1, firstWord)));

var lastWord = searchExpression.Last();
var lastWordIndex = highlights.IndexOf(highlights.Last(x => String.Equals(x.Item1, lastWord)));

var sanitizedHighlights = highlights.SkipWhile((x, i) => i < firstWordIndex);
sanitizedHighlights = sanitizedHighlights.TakeWhile((x, i) => i <= lastWordIndex);

highlights = new List<Tuple<string, int, int>>();
foreach (var word in searchExpression.Reverse())
{
var lastOccurence = sanitizedHighlights.Last((x) => String.Equals(x.Item1, word));
sanitizedHighlights = sanitizedHighlights.TakeWhile(x => x.Item3 < lastOccurence.Item2);
highlights.Add(lastOccurence);
}

highlights.Reverse();

return highlights.Select(x => new Tuple<int, int>(x.Item2, x.Item3));
}

public static List<int> AllIndexesOf(string str, string value, StringComparison comparison)
{
if (String.IsNullOrEmpty(value))
throw new ArgumentException("the string to find may not be empty", "value");

List<int> indexes = new List<int>();
for (int index = 0; ; index += value.Length)
{
index = str.IndexOf(value, index, comparison);
if (index == -1)
return indexes;
indexes.Add(index);
}
}

使用此代码和文本:

"No, about the fee, since there is limited overhead space on the plane, I assure you, there will be no fee for checking the bags, I can go ahead and fill out all the checked baggage forms for you."

我得到了以下结果:

没有,关于费用,因为飞机上的空间有限,我向你保证,托运行李没有费用可以继续为您填写 所有托运行李表格

============================================= =======

编辑 2 使用 Regex 方法,结合之前尝试获得的经验。
请注意,如果表达式中的每个单词都没有找到,则不会找到突出显示。

public static IEnumerable<Tuple<int,int>> GetHighlights(string expression, string search)
{
var highlights = new List<Tuple<string, int, int>>();

var wordsToHighlight = new Regex(@"(\w+|[^\s]+)").
Matches(search).
Cast<Match>().
Select(x => x.Value);

foreach(var wordToHighlight in wordsToHighlight)
{
Regex findMatchRegex = null;
if (new Regex(@"\W").IsMatch(wordToHighlight))
findMatchRegex = new Regex(String.Format(@"({0})", wordToHighlight), RegexOptions.IgnoreCase); // is punctuation
else
findMatchRegex = new Regex(String.Format(@"((?<!\w){0}(?!\w))", wordToHighlight), RegexOptions.IgnoreCase); // si word

var matches = findMatchRegex.Matches(expression).Cast<Match>().Select(match => new Tuple<string, int, int>(wordToHighlight, match.Index, match.Index + wordToHighlight.Length));

if (matches.Any())
highlights.AddRange(matches);
else
return new List<Tuple<int, int>>();
}

// Reduce the scope of the highlights in order to
// keep the indexes as together as possible.
var firstWord = wordsToHighlight.First();
var firstWordIndex = highlights.IndexOf(highlights.Last(x => String.Equals(x.Item1, firstWord)));

var lastWord = wordsToHighlight.Last();
var lastWordIndex = highlights.IndexOf(highlights.Last(x => String.Equals(x.Item1, lastWord)));

var sanitizedHighlights = highlights.SkipWhile((x, i) => i < firstWordIndex);
sanitizedHighlights = sanitizedHighlights.TakeWhile((x, i) => i <= lastWordIndex);

highlights = new List<Tuple<string, int, int>>();
foreach (var word in wordsToHighlight.Reverse())
{
var lastOccurence = sanitizedHighlights.Last((x) => String.Equals(x.Item1, word));
sanitizedHighlights = sanitizedHighlights.TakeWhile(x => x.Item3 < lastOccurence.Item2);
highlights.Add(lastOccurence);
}

highlights.Reverse();

return highlights.Select(x => new Tuple<int, int>(x.Item2, x.Item3));
}

还需要注意的是,这种方法现在可以处理标点符号。得到如下结果。

输入:
No, about the fee, since there is limited overhead space on the plane, I assure you, there will be no fee for checking the bags, I can go ahead and fill out all the checked baggage forms for you.

搜索:
no fee, I fill out the forms

输出:
不,关于费用,由于飞机上的头顶空间有限,我向你保证,托运行李没有 费用, 可以为您填写 所有托运行李表格 .

输入:
When First Class Glass receives your call, we will assign a repair person to visit you to assist.

搜索:
we assign a repair person

输出:
当 First Class Glass 接到您的电话时,我们指派 a 修理 给拜访您以提供帮助。

关于c# - 函数/正则表达式匹配较大字符串中的字符串部分并突出显示这些部分,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42725934/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com