excel - 一种更准确、更高效的模糊搜索算法-6ren

excel - 一种更准确、更高效的模糊搜索算法

转载作者：行者123 更新时间：2023-12-04 19:50:34

我一直在研究互联网上的模糊匹配/搜索算法。我尝试了几种解决方案。

唯一给出比较准确结果的是 Excel 先生 (http://www.mrexcel.com/pc07.shtml)。这种方法的问题在于单词中字符的顺序或相对位置，而单词本身的顺序对结果没有影响。

我想根据单词的相对位置以及每个单词的字母顺序获得更好的结果。

Function FuzzyMatchByWord(ByVal lsPhrase1 As String, ByVal lsPhrase2 As String, Optional lbStripVowels As Boolean = False, Optional lbDiscardExtra As Boolean = False) As Double

'
' Compare two phrases and return a similarity value (between 0 and 100).
'
' Arguments:
'
' 1. Phrase1        String; any text string
' 2. Phrase2        String; any text string
' 3. StripVowels    Optional to strip all vowels from the phrases
' 4. DiscardExtra   Optional to discard any unmatched words
'


'local variables
Dim lsWord1() As String
Dim lsWord2() As String
Dim ldMatch() As Double
Dim ldCur As Double
Dim ldMax As Double
Dim liCnt1 As Integer
Dim liCnt2 As Integer
Dim liCnt3 As Integer
Dim lbMatched() As Boolean
Dim lsNew As String
Dim lsChr As String
Dim lsKeep As String

'set default value as failure
FuzzyMatchByWord = 0

'create list of characters to keep
lsKeep = "BCDFGHJKLMNPQRSTVWXYZ0123456789 "
If Not lbStripVowels Then
    lsKeep = lsKeep & "AEIOU"
End If

'clean up phrases by stripping undesired characters
'phrase1
lsPhrase1 = Trim$(UCase$(lsPhrase1))
lsNew = ""
For liCnt1 = 1 To Len(lsPhrase1)
    lsChr = Mid$(lsPhrase1, liCnt1, 1)
    If InStr(lsKeep, lsChr) <> 0 Then
        lsNew = lsNew & lsChr
    End If
Next
lsPhrase1 = lsNew
lsPhrase1 = Replace(lsPhrase1, "  ", " ")
lsWord1 = Split(lsPhrase1, " ")
If UBound(lsWord1) = -1 Then
    Exit Function
End If
ReDim ldMatch(UBound(lsWord1))
'phrase2
lsPhrase2 = Trim$(UCase$(lsPhrase2))
lsNew = ""
For liCnt1 = 1 To Len(lsPhrase2)
    lsChr = Mid$(lsPhrase2, liCnt1, 1)
    If InStr(lsKeep, lsChr) <> 0 Then
        lsNew = lsNew & lsChr
    End If
Next
lsPhrase2 = lsNew
lsPhrase2 = Replace(lsPhrase2, "  ", " ")
lsWord2 = Split(lsPhrase2, " ")
If UBound(lsWord2) = -1 Then
    Exit Function
End If
ReDim lbMatched(UBound(lsWord2))

'exit if empty
If Trim$(lsPhrase1) = "" Or Trim$(lsPhrase2) = "" Then
    Exit Function
End If

'compare words in each phrase
For liCnt1 = 0 To UBound(lsWord1)
    ldMax = 0
    For liCnt2 = 0 To UBound(lsWord2)
        If Not lbMatched(liCnt2) Then
            ldCur = FuzzyMatch(lsWord1(liCnt1), lsWord2(liCnt2))
            If ldCur > ldMax Then
                liCnt3 = liCnt2
                ldMax = ldCur
            End If
        End If
    Next
    lbMatched(liCnt3) = True
    ldMatch(liCnt1) = ldMax
Next

'discard extra words
ldMax = 0
For liCnt1 = 0 To UBound(ldMatch)
    ldMax = ldMax + ldMatch(liCnt1)
Next
If lbDiscardExtra Then
    liCnt2 = 0
    For liCnt1 = 0 To UBound(lbMatched)
        If lbMatched(liCnt1) Then
            liCnt2 = liCnt2 + 1
        End If
    Next
Else
    liCnt2 = UBound(lsWord2) + 1
End If

'return overall similarity
FuzzyMatchByWord = 100 * (ldMax / liCnt2)


End Function

Function FuzzyMatch(Fstr As String, Sstr As String) As Double

'
' Code sourced from: http://www.mrexcel.com/pc07.shtml
' Credited to: Ed Acosta
' Modified: Joe Stanton
'

Dim L, L1, L2, M, SC, T, R As Integer

L = 0
M = 0
SC = 1

L1 = Len(Fstr)
L2 = Len(Sstr)

Do While L < L1
    L = L + 1
    For T = SC To L1
        If Mid$(Sstr, L, 1) = Mid$(Fstr, T, 1) Then
            M = M + 1
            SC = T
            T = L1 + 1
        End If
    Next T
Loop

If L1 = 0 Then
    FuzzyMatch = 0
Else
    FuzzyMatch = M / L1
End If

End Function

我正在尝试将试算表中的账户描述与过去 30,000 个账户描述的列表进行比较，我想为每个账户找出前 5 个结果。

举个例子:

Debug.Print FuzzyMatchByWord("Cash and Cash Equivalents", "Bank and Cash")
Debug.Print FuzzyMatchByWord("Cash and Cash Equivalents", "Cash and Bank")
Debug.Print FuzzyMatchByWord("Cash and Cash Equivalents", "Shack sequential")
Debug.Print FuzzyMatchByWord("Cash and Cash Equivalents", "Sequential shack")

我希望单词在短语中的相对位置对分数的影响更大，我也希望字母的顺序有更大的影响。与现金和现金等价物相比，连续的小屋不应该得分那么高。

最佳答案

比较字符串时，我通常使用 Levenshtein-Distance .您可以找到算法的实现 here .您可以通过比率扩展该函数，这是衡量两个字符串“接近”程度的一个很好的指标。

Function levenshtein(a As String, b As String, Optional ratio As Boolean) As Double

    Dim i As Long, j As Long, cost As Long
    Dim d() As Long
    Dim min1 As Long, min2 As Long, min3 As Long

    If Len(a) = 0 Then
        levenshtein = Len(b)
        Exit Function
    End If

    If Len(b) = 0 Then
        levenshtein = Len(a)
        Exit Function
    End If

    ReDim d(Len(a), Len(b))

    For i = 0 To Len(a)
        d(i, 0) = i
    Next

    For j = 0 To Len(b)
        d(0, j) = j
    Next

    For i = 1 To Len(a)
        For j = 1 To Len(b)
            If Mid(a, i, 1) = Mid(b, j, 1) Then
                cost = 0
            Else
                cost = 1
            End If

            min1 = (d(i - 1, j) + 1)
            min2 = (d(i, j - 1) + 1)
            min3 = (d(i - 1, j - 1) + cost)

            d(i, j) = Application.WorksheetFunction.Min(min1, min2, min3)
        Next
    Next

    If ratio Then
        levenshtein = (Len(a) + Len(b) - d(Len(a), Len(b))) / (Len(a) + Len(b))
    Else
        levenshtein = d(Len(a), Len(b))
    End If

End Function

举个例子:

Debug.Print levenshtein("Cash and Cash Equivalents", "Bank and Cash", True)
Debug.Print levenshtein("Cash and Cash Equivalents", "Cash and Bank", True)
Debug.Print levenshtein("Cash and Cash Equivalents", "Shack sequential", True)
Debug.Print levenshtein("Cash and Cash Equivalents", "Sequential shack", True)

 0.605263157894737 
 0.631578947368421 
 0.560975609756098 
 0.48780487804878

编辑

我想字符串比较会大大降低速度。加快速度的一种方法是将字符串转换为字节数组并比较数值。这可以这样做:

Function levenshtein(a As String, b As String, Optional ratio As Boolean) As Double

    Dim i As Long, j As Long
    Dim k As Long, l As Long
    Dim cost As Long
    Dim d() As Long
    Dim min1 As Long, min2 As Long, min3 As Long
    Dim aByte1() As Byte, aByte2() As Byte

    If Len(a) = 0 Then
        levenshtein = Len(b)
        Exit Function
    End If

    If Len(b) = 0 Then
        levenshtein = Len(a)
        Exit Function
    End If

    ReDim d(Len(a), Len(b))

    For i = 0 To Len(a)
        d(i, 0) = i
    Next

    For j = 0 To Len(b)
        d(0, j) = j
    Next

    aByte1 = a
    aByte2 = b
    For i = 0 To UBound(aByte1, 1) Step 2
        k = Int(i / 2) + 1
        For j = 0 To UBound(aByte2, 1) Step 2
            If aByte1(i) = aByte2(j) Then
                cost = 0
            Else
                cost = 1
            End If
            l = Int(j / 2) + 1
            min1 = (d(k - 1, l) + 1)
            min2 = (d(k, l - 1) + 1)
            min3 = (d(k - 1, l - 1) + cost)

            d(k, l) = Application.WorksheetFunction.Min(min1, min2, min3)
        Next
    Next

    If ratio Then
        levenshtein = (Len(a) + Len(b) - d(Len(a), Len(b))) / (Len(a) + Len(b))
    Else
        levenshtein = d(Len(a), Len(b))
    End If

End Function

关于excel - 一种更准确、更高效的模糊搜索算法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57489309/

文章推荐： python - 通过需要使用 Excel VBA 凭据的 API 调用获取数据

文章推荐：升级到 Office 365 专业增强版后，Excel VBA 运行速度极慢

算法~利用zset实现滑动窗口限流
滑动窗口限流滑动窗口限流是一种常用的限流算法，通过维护一个固定大小的窗口，在单位时间内允许通过的请求次数不超过设定的阈值。具体来说，滑动窗口限流算法通常包括以下几个步骤：初始化：设置窗口
【算法】表达式求值
表达式求值：一个只有+,-,*,/的表达式，没有括号一种神奇的做法：使用数组存储数字和运算符，先把优先级别高的乘法和除法计算出来，再计算加法和减法 int GetVal(string s){
【算法】前缀和
【算法】前缀和题目先来看一道题目：（前缀和模板题）已知一个数组A[]，现在想要求出其中一些数字的和。输入格式：先是整数N,M，表示一共有N个数字，有M组询问接下来有N个数，表示A[1]..
【算法】二叉树的各种遍历方式
1.前序遍历根-左-右的顺序遍历，可以使用递归 void preOrder(Node *u){ if(u==NULL)return; printf("%d ",u->val);
【算法】01背包
先看题目物品不能分隔，必须全部取走或者留下，因此称为01背包（只有不取和取两种状态）看第一个样例我们需要把4个物品装入一个容量为10的背包我们可以简化问题，从小到大入手分析 weightva
算法 - 矩阵中被另一种颜色包围的颜色
我最近在一次采访中遇到了这个问题: 给出以下矩阵: [[ R R R R R R], [ R B B B R R], [ B R R R B B], [ R B R R R R]] 找出是否有任
使用Outlook发送电子邮件的C++算法
我正在尝试通过 C++ 算法从我的 outlook 帐户发送一封电子邮件，该帐户已经打开并记录，但真的不知道从哪里开始(对于 outlook-c++ 集成)，谷歌也没有帮我这么多。任何提示将不胜感激。
容器上滑动窗口的C++算法
我发现自己像这样编写了一个手工制作的 while 循环: std::list foo; // In my case, map, but list is simpler auto currentPoin
检测正方形后运行命令的c++算法
我有用于检测正方形的 opencv 代码。现在我想在检测正方形后，代码运行另一个命令。代码如下: #include "cv.h" #include "cxcore.h" #include "high
二值图像的泛洪填充C++算法
我正在尝试模拟一个 matlab 函数“imfill”来填充二进制图像(1 和 0 的二维矩阵)。我想在矩阵中指定一个起点，并像 imfill 的 4 连接版本那样进行洪水填充。这是否已经存在于
算法递归公式
我正在阅读 Robert Sedgewick 的《C++ 算法》。 Basic recurrences section it was mentioned as 这种循环出现在循环输入以消除一个项目的递
算法 - 如何生成日期结构？
我正在思考如何在我的日历中生成代表任务的数据结构(仅供我个人使用)。我有来自 DBMS 的按日期排序的任务记录，如下所示: 买牛奶(18.1.2013) 任务日期 (2013-01-15) 任务标签(
算法:查找恰好出现两次的元素
输入一个未排序的整数数组A[1..n]只有 O(d) :(d int) 计算每个元素在单次迭代中出现在列表中的次数。 map 是balanced Binary Search Tree基于确保 O(nl
算法——基于寻找最大匹配数
我遇到了一个问题，但我仍然不知道如何解决。我想出了如何用蛮力的方式来做到这一点，但是当有成千上万的元素时它就不起作用了。 Problem: Say you are given the followin
算法 - 用于计算成对相互出现的次数
我有一个列表列表。 L1= [[...][...][.......].......]如果我在展平列表后获取所有元素并从中提取唯一值，那么我会得到一个列表 L2。我有另一个列表 L3，它是 L2 的某个
算法 - 在矩阵中求和
我们得到二维矩阵数组(假设长度为 i 和宽度为 j)和整数 k我们必须找到包含这个或更大总和的最小矩形的大小F.e k=7 4 1 1 1 1 1 4 4 Anwser是2，因为4+4=8 >= 7，
算法:根据周数获取下一年日期工作类次类型
我实行 3 类倒制，每周换类。顺序为早类 (m)、晚类 (n) 和下午类 (a)。我固定的订单，即它永远不会改变，即使那个星期不工作也是如此。我创建了一个函数来获取 ISO 周数。当我给它一个日期时
算法 - 找到满足输入元素任意组合的所有集合
假设我们有一个输入，它是一个元素列表: {a, b, c, d, e, f} 还有不同的集合，可能包含这些元素的任意组合，也可能包含不在输入列表中的其他元素: A:{e,f} B:{d,f,a} C:
算法:添加新元素时如何找到集合的子集？
我有一个子集算法，可以找到给定集合的所有子集。原始集合的问题在于它是一个不断增长的集合，如果向其中添加元素，我需要再次重新计算它的子集。有没有一种方法可以优化子集算法，该算法可以从最后一个计算点重新
算法:按预期频率将符号压缩成位串？
我有一个包含 100 万个符号及其预期频率的表格。我想通过为每个符号分配一个唯一(且前缀唯一)的可变长度位串来压缩这些符号的序列，然后将它们连接在一起以表示序列。我想分配这些位串，以使编码序列的预

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

excel - 一种更准确、更高效的模糊搜索算法