c# - 函数/正则表达式匹配较大字符串中的字符串部分并突出显示这些部分-6ren

c# - 函数/正则表达式匹配较大字符串中的字符串部分并突出显示这些部分

转载作者：太空宇宙更新时间：2023-11-03 15:07:24

我正在尝试构建一个函数，该函数采用搜索字符串并匹配较大字符串中的部分并突出显示它们。请参见下面的示例:

Original String: 由于飞机头顶空间有限，我向你保证，托运行李是免费的，我可以继续填写所有托运行李表你。

要搜索和突出显示的文本:免费，我填写表格

期望的结果:由于飞机上的头顶空间有限，我向您保证，托运行李将免费，我可以继续填写所有托运行李表格。

我可以搜索完整的字符串或使用子字符串一次搜索一个词，但两者都不会产生所需的结果。诀窍可能是以某种方式从完整字符串开始递归搜索，然后逐渐将其分解成更小的部分，直到部分匹配为止。有几个假设:

搜索必须尽可能贪心，即先匹配字符串的较大部分，然后再尝试匹配较小的部分或单个单词。
在找到任何匹配项后，搜索将始终向前推进，即如果在位置 x 找到前 2 个单词，则单词 3 和 4 将始终在 x 之后，而不是在 x 之前。

希望这是有道理的。谁能指出我正确的方向？我搜索了该网站，但没有找到与我要查找的内容类似的内容。

谢谢

最佳答案

如果这对您有帮助，请告诉我。它没有使用 Regex 来查找字符串，只是 IndexOf .

它首先获取要突出显示的单词 Tuple表示单词的开始索引和结束索引。

它使用围绕单词的前缀和后缀突出显示文本(此处:html 标记)。

static void Main(string[] args)
{
    var input = "Since there is limited overhead space on the plane, I assure you, there will be no fee for checking the bags, I can go ahead and fill out all the checked baggage forms for you";
    var searchExpression = "no fee, I fill out the forms";

    var highlightedInput = HighlightString(input, searchExpression, "<b>", "</b>");

    Console.WriteLine(highlightedInput);
    Console.ReadLine();
}

public static IEnumerable<Tuple<int, int>> GetHighlights(string input, string searchExpression)
{
    var splitIntoWordsRegex = new Regex(@"\W+");
    var words = splitIntoWordsRegex.Split(searchExpression);
    return GetHighlights(input, words);
}

public static IEnumerable<Tuple<int, int>> GetHighlights(string input, IEnumerable<string> searchExpression)
{
    var highlights = new List<Tuple<int, int>>();

    var lastMatchedIndex = 0;
    foreach (var word in searchExpression)
    {
        var indexOfWord = input.IndexOf(word, lastMatchedIndex,  StringComparison.CurrentCulture);
        var lastIndexOfWord = indexOfWord + word.Length;

        highlights.Add(new Tuple<int, int>(indexOfWord, lastIndexOfWord));

        lastMatchedIndex = lastIndexOfWord;
    }

    return highlights;
}

public static string HighlightString(string input, string searchExpression, string highlightPrefix, string highlightSufix)
{
    var highlights = GetHighlights(input, searchExpression).ToList();

    var output = input;
    for (int i = 0, j = highlights.Count; i<j; i++)
    {
        int diffInputOutput = output.Length - input.Length;
        output = output.Insert(highlights[i].Item1 + diffInputOutput, highlightPrefix);

        diffInputOutput = output.Length - input.Length;
        output = output.Insert(highlights[i].Item2 + diffInputOutput, highlightSufix);
    }

    return output;
}

================== 编辑 ======================

为了减少突出显示的最小/最大索引，您可以使用下面的代码。虽然不是最漂亮的，但可以胜任。

它获取一个词的所有可用索引(感谢 Finding ALL positions of a substring in a large string in C# )。将它们添加到 highlights ，然后操作此集合以保持关闭匹配与您需要的匹配。

public static IEnumerable<Tuple<int, int>> GetHighlights(string input, IEnumerable<string> searchExpression)
{
    var highlights = new List<Tuple<string, int, int>>();

    // Finds all the indexes for 
    // all the words found.
    foreach (var word in searchExpression)
    {
        var allIndexesOfWord = AllIndexesOf(input, word, StringComparison.InvariantCultureIgnoreCase);
        highlights.AddRange(allIndexesOfWord.Select(index => new Tuple<string, int, int>(word, index, index + word.Length)));
    }

    // Reduce the scope of the highlights in order to 
    // keep the indexes as together as possible.
    var firstWord = searchExpression.First();
    var firstWordIndex = highlights.IndexOf(highlights.Last(x => String.Equals(x.Item1, firstWord)));

    var lastWord = searchExpression.Last();
    var lastWordIndex = highlights.IndexOf(highlights.Last(x => String.Equals(x.Item1, lastWord)));

    var sanitizedHighlights = highlights.SkipWhile((x, i) => i < firstWordIndex);
    sanitizedHighlights = sanitizedHighlights.TakeWhile((x, i) => i <= lastWordIndex);

    highlights = new List<Tuple<string, int, int>>();
    foreach (var word in searchExpression.Reverse())
    {
        var lastOccurence = sanitizedHighlights.Last((x) => String.Equals(x.Item1, word));
        sanitizedHighlights = sanitizedHighlights.TakeWhile(x => x.Item3 < lastOccurence.Item2);
        highlights.Add(lastOccurence);
    }

    highlights.Reverse();

    return highlights.Select(x => new Tuple<int, int>(x.Item2, x.Item3));
}

public static List<int> AllIndexesOf(string str, string value, StringComparison comparison)
{
    if (String.IsNullOrEmpty(value))
        throw new ArgumentException("the string to find may not be empty", "value");

    List<int> indexes = new List<int>();
    for (int index = 0; ; index += value.Length)
    {
        index = str.IndexOf(value, index, comparison);
        if (index == -1)
            return indexes;
        indexes.Add(index);
    }
}

使用此代码和文本:

"No, about the fee, since there is limited overhead space on the plane, I assure you, there will be no fee for checking the bags, I can go ahead and fill out all the checked baggage forms for you."

我得到了以下结果:

没有，关于费用，因为飞机上的空间有限，我向你保证，托运行李没有费用，我可以继续为您填写所有托运行李表格。

============================================= =======

编辑 2 使用 Regex 方法，结合之前尝试获得的经验。
请注意，如果表达式中的每个单词都没有找到，则不会找到突出显示。

public static IEnumerable<Tuple<int,int>> GetHighlights(string expression, string search)
{
    var highlights = new List<Tuple<string, int, int>>();

    var wordsToHighlight = new Regex(@"(\w+|[^\s]+)").
        Matches(search).
        Cast<Match>().
        Select(x => x.Value);

    foreach(var wordToHighlight in wordsToHighlight)
    {
        Regex findMatchRegex = null;
        if (new Regex(@"\W").IsMatch(wordToHighlight))
            findMatchRegex = new Regex(String.Format(@"({0})", wordToHighlight), RegexOptions.IgnoreCase);  // is punctuation
        else
            findMatchRegex = new Regex(String.Format(@"((?<!\w){0}(?!\w))", wordToHighlight), RegexOptions.IgnoreCase); // si word

        var matches = findMatchRegex.Matches(expression).Cast<Match>().Select(match => new Tuple<string, int, int>(wordToHighlight, match.Index, match.Index + wordToHighlight.Length));

        if (matches.Any())
            highlights.AddRange(matches);
        else
            return new List<Tuple<int, int>>();
    }

    // Reduce the scope of the highlights in order to 
    // keep the indexes as together as possible.
    var firstWord = wordsToHighlight.First();
    var firstWordIndex = highlights.IndexOf(highlights.Last(x => String.Equals(x.Item1, firstWord)));

    var lastWord = wordsToHighlight.Last();
    var lastWordIndex = highlights.IndexOf(highlights.Last(x => String.Equals(x.Item1, lastWord)));

    var sanitizedHighlights = highlights.SkipWhile((x, i) => i < firstWordIndex);
    sanitizedHighlights = sanitizedHighlights.TakeWhile((x, i) => i <= lastWordIndex);

    highlights = new List<Tuple<string, int, int>>();
    foreach (var word in wordsToHighlight.Reverse())
    {
        var lastOccurence = sanitizedHighlights.Last((x) => String.Equals(x.Item1, word));
        sanitizedHighlights = sanitizedHighlights.TakeWhile(x => x.Item3 < lastOccurence.Item2);
        highlights.Add(lastOccurence);
    }

    highlights.Reverse();

    return highlights.Select(x => new Tuple<int, int>(x.Item2, x.Item3));
}

还需要注意的是，这种方法现在可以处理标点符号。得到如下结果。

输入:
No, about the fee, since there is limited overhead space on the plane, I assure you, there will be no fee for checking the bags, I can go ahead and fill out all the checked baggage forms for you.

搜索:
no fee, I fill out the forms

输出:
不，关于费用，由于飞机上的头顶空间有限，我向你保证，托运行李没有费用, 我可以为您填写所有托运行李表格 .

输入:
When First Class Glass receives your call, we will assign a repair person to visit you to assist.

搜索:
we assign a repair person

输出:
当 First Class Glass 接到您的电话时，我们将指派 a 修理人给拜访您以提供帮助。

关于c# - 函数/正则表达式匹配较大字符串中的字符串部分并突出显示这些部分，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42725934/

文章推荐： c# - 清楚我的代码中的接口(interface)使用情况

文章推荐： python - 使用两个轴排序约束对三维 numpy 数组进行排序

文章推荐： python - 向函数添加行

文章推荐： c# - Telerik Kendo MVC Grid - 设置自定义过滤

grammar - 是否有可能使这个 YACC 语法明确？表达式 : . .. |表达式表达式
我正在用 yacc/bison 编写一个简单的计算器。表达式的语法看起来有点像这样: expr : NUM | expr '+' expr { $$ = $1 + $3; } | expr '-'
java - Lambda 表达式 - 使用 lambda 表达式
我开始学习 lambda 表达式，并在以下情况下遇到了以下语句: interface MyNumber { double getValue(); } MyNumber number; nu
C# Linq Where(表达式).FirstorDefault() 与 .FirstOrDefault(表达式)
这两个 Linq 查询有什么区别: var result = ResultLists().Where( c=> c.code == "abc").FirstOrDefault(); // vs. va
c++ - 为什么在未计算的操作数中不允许使用 lambda 表达式，但在常量表达式的未计算部分中允许使用 lambda 表达式？
如果我们查看 draft C++ standard 5.1.2 Lambda 表达式段 2 说(强调我的 future ): The evaluation of a lambda-expressio
java - -source 1.6 不支持 lambda 表达式 [错误](使用 -source 8 或更高版本启用 lambda 表达式)
我使用的是 Mule 4.2.2 运行时、studio 7.5.1 和 Oracle JDK 1.8.0_251。我在 java 代码中使用 Lambda 表达式，该表达式由 java Invoke
XPath 表达式
我是 XPath 的新手。我有网页的html源 http://london.craigslist.co.uk/com/1233708939.html 现在我想从上面的页面中提取以下数据完整日期电子
boolean 表达式
已关闭。这个问题是 off-topic 。目前不接受答案。想要改进这个问题吗？ Update the question所以它是on-topic用于堆栈溢出。已关闭10 年前。 Improve th
Cron 表达式
我将如何编写一个 Cron 表达式以在每天上午 8 点和下午 3:30 触发？我了解如何创建每天触发一次的表达式，而不是在多个设定时间触发。提前致谢最佳答案你应该只使用两行。 0 8 * * *
Java "..."表达式
这个问题已经有答案了: What do 3 dots next to a parameter type mean in Java? (9 个回答) varargs and the '...' argu
python 表达式
我是 python 新手，在阅读 BeautifulSoup 教程时，我不明白这个表达式“[x for x in titles if x.findChildren()][:-1]”我不明白？你能解释一
ruby 表达式
(?:) 这是一个有效的 ruby 正则表达式，谁能告诉我它是什么意思？谢谢最佳答案正如其他人所说，它被用作正则表达式的非捕获语法，但是，它也是正则表达式之外的有效 ruby 语法。在
JavaScript 表达式
这个问题在这里已经有了答案: Why does ++[[]][+[]]+[+[]] return the string "10"? (10 个答案) 关闭 8 年前。谁能帮我处理这个 JavaSc
Java 表达式
这个问题在这里已经有了答案: What is the "-->" operator in C++? (29 个答案) Java: Prefix/postfix of increment/decrem
Python单行 "for"表达式
这个问题在这里已经有了答案: List comprehension vs. lambda + filter (16 个答案) 关闭 10 个月前。我不确定我是否需要 lambda 或其他东西。但是，
C assert() 表达式
C 中的 assert() 函数工作原理对我来说就像一片黑暗的森林。根据这里的答案https://stackoverflow.com/a/1571360 ，您可以使用以下构造将自定义消息输出到您的断言
ada - 类型转换和 if 表达式
在this页，John Barnes 写道: If the conditional expression is the argument of a type conversion then effec
调度程序的 Cron 表达式
我必须创建一个调度程序，它必须每周从第一天上午 9 点到第二天晚上 11 点 59 分运行 2 天(星期四和星期五)。为此，我需要提供一个 cron 表达式。 0-0 0-0 9-23 ? * THU
派生类型列表上的 Linq 表达式
我正在尝试编写一个 Linq 表达式来检查派生类中的属性，但该列表由来自基类的成员组成。下面的示例代码。以“var list”开头的 Process 方法的第二行无法编译，但我不确定应该使用什么语法来
将某些匹配项转换为大写的 Sed 表达式
此 sed 表达式将输入字符串转换为两行输出字符串。两条输出行中的每一行都由输入的子串组成。第一行需要转换成大写: s:random_stuff$choice1\|choice2${\([^}]*
时间范围的 Cron 表达式
我正在使用 Quartz.Net 在我的应用程序中安排我的工作。我只是想知道是否可以为以下场景构建 CRON 表达式: Every second between 2:15AM and 5:20AM 最

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c# - 函数/正则表达式匹配较大字符串中的字符串部分并突出显示这些部分