c# - 用于在字符串列表 C# 中查找字符串匹配的最佳比较算法-6ren

c# - 用于在字符串列表 C# 中查找字符串匹配的最佳比较算法

转载作者：塔克拉玛干更新时间：2023-11-03 03:42:34

假设我有一个包含 100,000 个单词的列表。我想查明给定的字符串是否与该列表中的任何单词匹配，并且我想以最快的方式进行。我还想知道该字符串中以第一个字符开头形成的任何其他单词是否出现在列表中。

例如:

假设你有字符串“icedtgg”

“我”“我知道了”“冰”“冰镇”“冰”“icedtg”“icedtgg”

我正在尝试提出一个最佳比较算法，告诉我以下单词是否在我的列表中。

到目前为止，我的 100,000 个单词列表存储在一个

Dicitonary<char, List<string>> WordList;

哪里char是单词的第一个字符，List<string>是以该字符开头的所有单词。

所以 WordList['a']有一个以“a”开头的所有单词的列表(“ape”、“apple”、“art”等)，而“b”有一个以 b 开头的所有单词的列表等。

因为我知道我所有的单词都以“i”开头，所以我可以先将我的解决方案从 100,000 个单词缩小到仅以“i”开头的单词。

List<string> CurrentWordList = WordList['i'];

现在我检查

if( CurrentWordList[0].Length == 1 )

然后我知道我的第一个字符串匹配“i”，因为“i”将是列表中的第一个单词。这些列表预先按字母顺序排序，以免减慢匹配速度。

有什么想法吗？

*不，这不是硬件分配，我是一名专业的软件架构师，试图为乐趣/爱好/游戏开发找到最佳匹配算法。

最佳答案

我决定添加这个答案并不是因为它是您问题的最佳解决方案，而是为了说明两种可能的解决方案，这些解决方案相对简单并且在某种程度上符合您似乎正在遵循的方法。

下面的(未优化的)示例提供了一个极其简单的前缀特里树实现，每个消耗的字符使用一个节点。

public class SimplePrefixTrie
{
    private readonly Node _root = new Node(); // root represents empty string.

    private class Node
    {
        public Dictionary<char, Node> Children;
        public bool IsTerminal; // whether a full word ends here.

        public Node Find(string word, int index)
        {
            var child = default(Node);
            if (index < word.Length && Children != null)
                Children.TryGetValue(word[index], out child);
            return child;
        }

        public Node Add(string word, int toConsume)
        {
            var child = default(Node);
            if (toConsume == word.Length)
                this.IsTerminal = true;
            else if (Children == null || !Children.TryGetValue(word[toConsume], out child))
            {
                if (Children == null)
                    Children = new Dictionary<char, Node>();
                Children[word[toConsume]] = child = new Node();
            }
            return child;
        }
    }

    public void AddWord(string word)
    {
        var ndx = 0;
        var cur = _root;
        while (cur != null)
            cur = cur.Add(word, ndx++);
    }

    public IEnumerable<string> FindWordsMatchingPrefixesOf(string searchWord)
    {
        var ndx = 0;
        var cur = _root;
        while (cur != null)
        {
            if (cur.IsTerminal)
                yield return searchWord.Substring(0, ndx);
            cur = cur.Find(searchWord, ndx++);
        }
    }
}

下面还添加了一个压缩前缀特里树的简单实现。它遵循与上述示例几乎相同的方法，但存储共享前缀部分，而不是单个字符。当现有的存储前缀变为共享且需要拆分为两部分时，它会拆分节点。

public class SimpleCompressedPrefixTrie
{
    private readonly Node _root = new Node();

    private class Node
    {
        private Dictionary<char, Node> _children;
        public string PrefixValue = string.Empty;
        public bool IsTerminal;

        public Node Add(string word, ref int startIndex)
        {
            var n = FindSharedPrefix(word, startIndex);
            startIndex += n;
            if (n == PrefixValue.Length) // full prefix match
            {
                if (startIndex == word.Length) // full match
                    IsTerminal = true;
                else
                    return AddToChild(word, ref startIndex);
            }
            else // partial match, need to split this node's prefix.
                SplittingAdd(word, n, ref startIndex);
            return null;
        }

        public Node Find(string word, ref int startIndex, out int matchLen)
        {
            var n = FindSharedPrefix(word, startIndex);
            startIndex += n;
            matchLen = -1;
            if (n == PrefixValue.Length)
            {
                if (IsTerminal)
                    matchLen = startIndex;
                var child = default(Node);
                if (_children != null && startIndex < word.Length && _children.TryGetValue(word[startIndex], out child))
                {
                    startIndex++; // consumed map key character.
                    return child;
                }
            }
            return null;
        }

        private Node AddToChild(string word, ref int startIndex)
        {
            var key = word[startIndex++]; // consume the mapping character
            var nextNode = default(Node);
            if (_children == null)
                _children = new Dictionary<char, Node>();
            else if (_children.TryGetValue(key, out nextNode))
                return nextNode;
            var remainder = word.Substring(startIndex);
            _children[key] = new Node() { PrefixValue = remainder, IsTerminal = true };
            return null; // consumed.
        }

        private void SplittingAdd(string word, int n, ref int startIndex)
        {
            var curChildren = _children;
            _children = new Dictionary<char, Node>();
            _children[PrefixValue[n]] = new Node()
            {
                PrefixValue = this.PrefixValue.Substring(n + 1),
                IsTerminal = this.IsTerminal,
                _children = curChildren
            };
            PrefixValue = PrefixValue.Substring(0, n);
            IsTerminal = startIndex == word.Length;
            if (!IsTerminal)
            {
                var prefix = word.Length > startIndex + 1 ? word.Substring(startIndex + 1) : string.Empty;
                _children[word[startIndex]] = new Node() { PrefixValue = prefix, IsTerminal = true };
                startIndex++;
            }
        }

        private int FindSharedPrefix(string word, int startIndex)
        {
            var n = Math.Min(PrefixValue.Length, word.Length - startIndex);
            var len = 0;
            while (len < n && PrefixValue[len] == word[len + startIndex])
                len++;
            return len;
        }
    }

    public void AddWord(string word)
    {
        var ndx = 0;
        var cur = _root;
        while (cur != null)
            cur = cur.Add(word, ref ndx);
    }

    public IEnumerable<string> FindWordsMatchingPrefixesOf(string searchWord)
    {
        var startNdx = 0;
        var cur = _root;
        while (cur != null)
        {
            var matchLen = 0;
            cur = cur.Find(searchWord, ref startNdx, out matchLen);
            if (matchLen > 0)
                yield return searchWord.Substring(0, matchLen);
        };
    }
}

使用示例:

var trie = new SimplePrefixTrie(); // or new SimpleCompressedPrefixTrie();
trie.AddWord("hello");
trie.AddWord("iced");
trie.AddWord("i");
trie.AddWord("ice");
trie.AddWord("icecone");
trie.AddWord("dtgg");
trie.AddWord("hicet");
foreach (var w in trie.FindWordsMatchingPrefixesOf("icedtgg"))
    Console.WriteLine(w);

输出:

i
ice
iced

更新:选择正确的数据结构很重要

我认为更新可以提供一些值(value)来说明选择适合问题的数据结构的重要性以及涉及哪些类型的权衡。因此，我创建了一个小型基准测试应用程序，用于测试到目前为止为该问题提供的答案中的策略，并与基准引用实现进行比较。

朴素:是最简单的朴素解决方案。
JimMischel:基于 this answer 中的方法.
MattyMerrix:是根据你自己的回答here .
JimMattyDSL:结合了“JimMischel”和“MattyMerrix”方法，并在排序列表中使用更优化的二进制字符串搜索。
SimpleTrie 和 CompessedTrie 基于此答案中描述的两个实现。

完整的基准代码可以在 this gist 中找到.使用 10,000、100,000 和 1,000,000(随机生成的字符序列)单词的词典运行它并搜索 5,000 个术语的所有前缀匹配的结果是:

将 5000 个单词匹配到最大长度为 25 的 10000 个术语的字典

       Method              Memory (MB)         Build Time (s)        Lookup Time (s)
        Naive          0.64-0.64, 0.64     0.001-0.002, 0.001     6.136-6.312, 6.210
   JimMischel          0.84-0.84, 0.84     0.013-0.018, 0.016     0.083-0.113, 0.102
  JimMattyDSL          0.80-0.81, 0.80     0.013-0.018, 0.016     0.008-0.011, 0.010
   SimpleTrie       24.55-24.56, 24.56     0.042-0.056, 0.051     0.002-0.002, 0.002
CompessedTrie          1.84-1.84, 1.84     0.003-0.003, 0.003     0.002-0.002, 0.002
  MattyMerrix          0.83-0.83, 0.83     0.017-0.017, 0.017     0.034-0.034, 0.034

将 5000 个单词匹配到最大长度为 25 的 100000 个术语的字典

       Method              Memory (MB)         Build Time (s)        Lookup Time (s)
        Naive          6.01-6.01, 6.01     0.024-0.026, 0.025  65.651-65.758, 65.715
   JimMischel          6.32-6.32, 6.32     0.232-0.236, 0.233     1.208-1.254, 1.235
  JimMattyDSL          5.95-5.96, 5.96     0.264-0.269, 0.266     0.050-0.052, 0.051
   SimpleTrie    226.49-226.49, 226.49     0.932-0.962, 0.951     0.004-0.004, 0.004
CompessedTrie       16.10-16.10, 16.10     0.101-0.126, 0.111     0.003-0.003, 0.003
  MattyMerrix          6.15-6.15, 6.15     0.254-0.269, 0.259     0.414-0.418, 0.416

将 5000 个单词匹配到最大长度为 25 的 1000000 个术语的字典

       Method              Memory (MB)         Build Time (s)        Lookup Time (s)
   JimMischel       57.69-57.69, 57.69     3.027-3.086, 3.052  16.341-16.415, 16.373
  JimMattyDSL       60.88-60.88, 60.88     3.396-3.484, 3.453     0.399-0.400, 0.399
   SimpleTrie 2124.57-2124.57, 2124.57  11.622-11.989, 11.860     0.006-0.006, 0.006
CompessedTrie    166.59-166.59, 166.59     2.813-2.832, 2.823     0.005-0.005, 0.005
  MattyMerrix       62.71-62.73, 62.72     3.230-3.270, 3.251     6.996-7.015, 7.008

如您所见，(非空间优化)尝试所需的内存要高得多。它随着字典的大小增加，对于所有测试的实现都是 O(N)。

正如预期的那样，尝试的查找时间或多或少是恒定的:O(k)，仅取决于搜索项的长度。对于其他实现，时间将根据要搜索的字典的大小而增加。

请注意，可以为这个问题构建更优化的实现，搜索时间将接近 O(k)，并允许更紧凑的存储和减少内存占用。如果您映射到简化的字母表(例如，仅“A”-“Z”)，这也是可以利用的。

关于c# - 用于在字符串列表 C# 中查找字符串匹配的最佳比较算法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30578450/

文章推荐：近似社会保险号匹配搜索算法

文章推荐： java - 如何在 Java 中使用 AES 加密数据

文章推荐： java - 查找以特殊字符开头的单词java

文章推荐： javascript - 来自 JSON 的父子关系设置

javascript - 我需要将文本放在一个中，它位于一个 Div 中，该 Div 位于另一个 Div 中，该 Div 位于另一个 Div 中
我需要将文本放在中在一个 Div 中，在另一个 Div 中，在另一个 Div 中。所以这是它的样子: #document Change PIN
html - 两个背景图像。一个在 HTML 中，一个在 BODY 中。在 Firefox 中，主体图像未呈现
奇怪的事情发生了。我有一个基本的 html 代码。 html，头部， body 。(因为我收到了一些反对票，这里是完整的代码) 这是我的CSS: html { backgroun
ios - 将图像从 asset.xcassets 加载到 imageArray 中，并将其动态加载到 UIImageView 中，该 UIImageView 存在于 UICollectionView 中 - swift
我正在尝试将 Assets 中的一组图像加载到 UICollectionview 中存在的 ImageView 中，但每当我运行应用程序时它都会显示错误。而且也没有显示图像。我在ViewDidLoa
linux - 在 BASH 中，我需要根据 perl 脚本的输出更改一些环境变量。在 tcsh 中，我可以使用别名 eval 组合。不能在 bash 中
我需要根据带参数的 perl 脚本的输出更改一些环境变量。在 tcsh 中，我可以使用别名命令来评估 perl 脚本的输出。 tcsh: alias setsdk 'eval `/localhome/
asp.net - Windows 身份验证适用于 IIS，但不适用于 Kestrel/Microsoft.AspNetCore.Authentication.Negotiate(不在 Chrome 中，有时在 Edge 中，始终在 IE 中)？
我使用 Windows 身份验证创建了一个新的 Blazor(服务器端)应用程序，并使用 IIS Express 运行它。它将显示一条消息“Hello Domain\User!”来自右上方的以下 Ra
java - java 中 Kotlin 中的等价物是什么？
这是我的方法 void login(Event event);我想知道 Kotlin 中应该如何最佳答案在 Kotlin 中通配符运算符是 * 。它指示编译器它是未知的，但一旦知道，就不会有其他类
express - 在 Jade 中，为什么有时我可以按原样使用变量而有时必须将它们包含在#{......} 中？
看下面的代码 for story in book if story.title.length < 140 - var story
c - C 中 strstr() 中 for 循环的错误使用
我正在尝试用 C 语言学习字符串处理。我写了一个程序，它存储了一些音乐轨道，并帮助用户检查他/她想到的歌曲是否存在于存储的轨道中。这是通过要求用户输入一串字符来完成的。然后程序使用 strstr()
c - * 在 sscanf 中，* 在 [] 中
我正在学习 sscanf 并遇到如下格式字符串: sscanf("%[^:]:%[^*=]%*[*=]%n",a,b,&c); 我理解 %[^:] 部分意味着扫描直到遇到 ':' 并将其分配给 a。:
python - 在 Python (2.7.3) 中，如果 str(x) 中的任何字符在 str(y) 中(或 str(y) 在 str(x) 中)，我如何编写一个函数来回答？
def char_check(x,y): if (str(x) in y or x.find(y) > -1) or (str(y) in x or y.find(x) > -1):
ansible - 在 Ansible 中，如何将一行移动到一个 block 中？
我有一种情况，我想将文本文件中的现有行包含到一个新 block 中。 line 1 line 2 line in block line 3 line 4 应该变成 line 1 line 2 line
Django 调试工具栏显示在根 URL 中，但不显示在应用程序 URL 中
我有一个新项目，我正在尝试设置 Django 调试工具栏。首先，我尝试了快速设置，它只涉及将 'debug_toolbar' 添加到我的已安装应用程序列表中。有了这个，当我转到我的根 URL 时，调试
r - 在 R 中，Matlab 中 @ 函数句柄的等价物是什么？
在 Matlab 中，如果我有一个函数 f，例如签名是 f(a,b,c)，我可以创建一个只有一个变量 b 的函数，它将使用固定的 a=a1 和 c=c1 调用 f: g = @(b) f(a1, b,
swiftui - SwiftUI 中 ScrollView 中 VStack 元素中的神秘间距或填充
我不明白为什么 ForEach 中的元素之间有多余的垂直间距在 VStack 里面在 ScrollView 里面使用 GeometryReader 时渲染自定义水平分隔线。 Scrol
cookies - 什么应该存储在 session 中，什么应该存储在 cookie 中？
我想知道，是否有关于何时使用 session 和 cookie 的指南或最佳实践？什么应该和什么不应该存储在其中？谢谢! 最佳答案这些文档很好地了解了 session cookie 的安全问题以及
python - Python 中 matplotlib 中 3d 直方图的奇怪行为
我在 scipy/numpy 中有一个 Nx3 矩阵，我想用它制作一个 3 维条形图，其中 X 轴和 Y 轴由矩阵的第一列和第二列的值、高度确定每个条形的是矩阵中的第三列，条形的数量由 N 确定。
c - c 中 sem_init(...) 中 value 参数的不同用法
假设我用两种不同的方式初始化信号量 sem_init(&randomsem,0,1) sem_init(&randomsem,0,0) 现在， sem_wait(&randomsem) 在这两种情况下
c - 实际值存储在 pstr 中，但是该值如何存储在数组 "WORD"中
我怀疑该值如何存储在“WORD”中，因为 PStr 包含实际输出。？既然Pstr中存储的是小写到大写的字母，那么在printf中如何将其给出为“WORD”。有人可以吗？解释一下？ #include
javascript - 数组索引选择像在 numpy 中，但在 javascript 中
我有一个 3x3 数组: var my_array = [[0,1,2], [3,4,5], [6,7,8]]; 并想获得它的第一个 2
javascript - 在 Javascript 中，如何检测浏览器窗口何时在 View 中？
我意识到您可以使用如下方式轻松检查焦点: var hasFocus = true; $(window).blur(function(){ hasFocus = false; }); $(win

塔克拉玛干

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c# - 用于在字符串列表 C# 中查找字符串匹配的最佳比较算法