python - Excel 模糊查找中使用的算法-6ren

python - Excel 模糊查找中使用的算法

转载作者：太空宇宙更新时间：2023-11-03 21:45:12

24

4

我正在努力匹配两组公司名称。我试图用 Python 编写 Levenstien 距离。我对公司的简称及其尾随部分(如 Pvt, Ltd.)有疑问。我已经使用 Excel 模糊查找运行了相同的集合，并获得了良好的结果。我有一种方法可以看到 excel 模糊查找是如何编码的，并在 python 中使用相同的实现。

最佳答案

以下内容摘自Microsoft Fuzzy Lookup Add-In for Excel，Readme.docx。我希望这会有所帮助。

Advanced Concepts

Fuzzy Lookup technology is based upon a very simple, yet flexible measure of similarity between two records. Jaccard similarity Fuzzy Lookup uses Jaccard similarity, which is defined as the size of the set intersection divided by the size of the set union for two sets of objects. For example, the sets {a, b, c} and {a, c, d} have a Jaccard similarity of 2/4 = 0.5 because the intersection is {a, c} and the union is {a, b, c, d}. The more that the two sets have in common, the closer the Jaccard similarity will be to 1.0.

Weighted Jaccard similarity and tokenization of records With Fuzzy Lookup, you can assign weights to each item in a set and define the weighted Jaccard similarity as the total weight of the intersection divided by the total weight of the union. For the weighted sets {(a, 2), (b, 5), (c, 3)}, {(a, 2), (c, 3), (d, 7)}, the weighted Jaccard similariyt is (2 + 3)/(2 + 3 + 5 +7) = 5/17 = .294.

Because Jaccard similarity is defined over sets, Fuzzy Lookup must first convert data records to sets before it calculates the Jaccard similarity. Fuzzy Lookup converts the data to sets using a Tokenizer. For example, the record {“Jesper Aaberg”, “4567 Main Street”} might be tokenized into the set, {“ Jesper”, “Aaberg”, “4567”, “Main”, “Street”}. The default tokenizer is for English text, but one may change the LocaleId property in Configure=>Global Settings to specify tokenizers for other languages.

Token weighting Because not all tokens are of equal importance, Fuzzy Lookup assigns weights to tokens. Tokens are assigned high weights if they occur infrequently in a sample of records and low weights if they occur frequently. For example, frequent words such as “Corporation” might be given lower weight, while less frequent words such as “Abracadabra” might be given a higher weight. One may override the default token weights by supplying their own table of token weights.

Transformations Transformations greatly increase the power of Jaccard similarity by allowing tokens to be converted from one string to another. For instance, one might know that the name “Bob” can be converted to “Robert”; that “USA” is the same as “United States”; or that “Missispi” is a misspelling of “Mississippi”. There are many classes of such transformations that Fuzzy Lookup handles automatically such as spelling mistakes (using Edit Transformations described below), string prefixes, and string merge/split operations. You can also specify a table containing your own custom transformations.

Jaccard similarity under transformations The Jaccard similarity under transformations is the maximum Jaccard similarity between any two transformations of each set. Given a set of transformation rules, all possible transformations of the set are considered. For example, for the sets {a, b, c} and {a, c, d} and the transformation rules {b=>d, d=>e}, the Jaccard similarity is computed as follows: Variations of {a, b, c}: {a, b, c}, {a, d, c} Variations of {a, c, d}: {a, c, d}, {a, c, e} Maximum Jaccard similarity between all pairs: J({a, b, c}, {a, c, d}) = 2/4 = 0.5 J({a, b, c}, {a, c, e}) = 2/4 = 0.5 J({a, d, c}, {a, c, d}) = 3/3 = 1.0 J({a, d, c}, {a, c, e}) = 2/4 = 0.5 The maximum is 1.0. Note: Weghted Jaccard similiary under transformations is simply the maximum weighted Jaccard similarity across all pairs of transformed sets.

Edit distance Edit distance is the total number of character insertions, deletions, or substitutions that it takes to convert one string to another. For example, the edit distance between “misissipi” and “mississippi” is 2 because two character insertions are required. One of the transformation providers that’s included with Fuzzy Lookup is the EditTransformationProvider, which generates specific transformations for each input record and creates a transformation from the token to all words in its dictionary that are within a given edit distance. The normalized edit distance is the edit distance divided by the length of the input string. In the previous example, the normalized edit distance is 2/9 = .222.

关于python - Excel 模糊查找中使用的算法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52553735/

24

4

0

文章推荐： c# - 如何获取安装程序类中安装 MSI 的路径

JavaScript 模糊
我正在学习 Javascript，我正在尝试创建一个简单的下拉菜单。我想要的功能的示例可以在 Google 主页的顶部菜单中看到，其中包含“更多”和“设置”下拉菜单。我有一个使用 onclick()
Tinymce 模糊/焦点事件
我尝试捕捉 tinyMce 编辑器的模糊和焦点事件。我为此找到了以下方法。 ed.onInit.add(function(ed) { tiny
jQuery 模糊() 不起作用？
这里完全被难住了。尝试一些很简单的东西，但它不起作用: $("input.input1, textarea.input1").focus(function(){ $(this).addClas
jQuery 模糊() 和点击()
我有以下 jQuery 函数: 提交表单 $(".content").delegate('.entryButton','click', function() {var form = $(this).c
jQuery 切换焦点/模糊
如何使用 jQuery 在焦点/模糊上切换元素的 CSS？ $('.answerSpace').bind('blur', function(){ $('.normProf').toggleClass(
iPhone:模糊 UIImage
在我的 iPhone 应用程序中，我有一个黑白 UIImage。我需要模糊该图像(高斯模糊即可)。 iPhone 显然知道如何模糊图像，如 it does that when it draws sha
javascript - 如何向图像添加滤镜(模糊)
这个问题已经有答案了: Blurring an image via CSS? (6 个回答) 已关闭 7 年前。我有一个场景。我想随着循环的进行模糊我的图像。我怎样才能做到这一点？这是我的代码。
java - 模糊 'StringIndexOutOfBoundsException"
这个错误是在子字符串方法上抛出的，我发现很多线程都在处理这个问题，但我遇到的问题似乎有所不同。我知道如果您的字符串短于子字符串(开始，结束)大小，它会抛出此错误，但在任何内容传递到方法调用之前都会抛出
Qt运动(线性)模糊
是否有简单的解决方案可以在 Qt 中为图像添加运动模糊？还没有找到任何关于模糊的好教程。我需要一些非常简单的东西，我可以理解，如果我可以改变模糊角度，那就太好了。最佳答案 Qt 没有运动模糊过滤器。
jQuery - 模糊，但如果单击按钮则不会
我的搜索框在正常状态下很小。焦点对准时，它会展开，并显示一个提交按钮。这样做是为了节省空间。现在，在模糊时，搜索框再次缩小，提交按钮消失。问题是，通过使提交按钮成为“竞赛”以在正确的位置单击它，对提
c# - RenderTargetBitmap 模糊
您好，我正在使用 PngBitmapEncoder 从 Canvas 在内存中创建图像。 public void CaptureGraphic() { Canvas canvas = new
javascript - 模糊();单击鼠标中键
我已经搜索过谷歌、这个和其他论坛，但无济于事……太棒了，有没有可能有像 onMiddleClick="blur();"这样的东西？在单击鼠标中键时隐藏链接的焦点边框？最佳答案 $('a').clic
Android 纹理看起来不清楚/模糊
我无法在我的应用程序中正确渲染我的纹理。我使用的艺术品是精确的，并且已经缩放且尺寸合适，但是当我在手机上渲染它时，我的纹理突然不如原始艺术品清晰/精确，我不明白为什么。有人遇到过这个问题吗？最佳答
android - 使父布局背景变暗/模糊
这里有与上述主题相同的问题但没有得到答复我这里有布局我需要在底部布局中使用与顶部布局相同的图像，但使用模糊样式设置 alpha 没有帮助 - TextView 也会影响如何虚化down布局的背
java - 如何检测页面中是否包含句子(模糊)？
我已经搜索了一段时间，但到目前为止没有找到适合我需要的东西。 ( This was helpful, but not convincing ) 从两个不同的来源，我得到两个不同的字符串。我想检查较短的
Javascript 模糊，密码输入不起作用
我有这样的代码: var passwordTextBox = angular.element("#password"); passwordTextBox.blur(function()
JQuery 隐藏可折叠菜单点击其他地方 - 模糊 -
设置此 JQuery 函数无法正常工作。有时，如果我单击元素，什么也没有发生，并且它会触发隐藏可折叠菜单的功能，如果单击文档上的任意位置，则不会重定向到正确的 href。有什么更好的方法吗？ HTML
python - 裁剪功能后字母模糊/模糊
尝试通过将坐标列表保存到数组来在多个位置裁剪我的图像后，裁剪区域中的字母变得非常模糊，我无法弄清楚原因。原图看起来像裁剪后的图像看起来像题中代码如下: import numpy as np im
android - 如何仅使屏幕的一部分变暗/模糊？
我知道我们可以调暗/模糊屏幕，如 this post 所示. 我应该怎么做才能使它的一部分变暗/模糊，使单个(或多个) View 没有任何效果，从而使整个屏幕具有突出显示 View 的效果？此外，即
php(模糊)搜索匹配
如果有人曾经向 digg 提交过故事，它会检查该故事是否已经提交，我假设是通过模糊搜索。我想实现类似的东西，想知道他们是否使用开源的 php 类？ Soundex 不这样做，句子/字符串的长度可达

首页

博学

6Ren·AI

商城

python - Excel 模糊查找中使用的算法