levenshtein-distance - 比较相似度算法-6ren

levenshtein-distance - 比较相似度算法

转载作者：行者123 更新时间：2023-12-03 01:28:24

29

4

我想使用字符串相似度函数来查找数据库中损坏的数据。

我遇到了其中几个:

贾罗，
贾罗-温克勒，
编辑，
欧几里得和
Q-gram，

我想知道它们之间有什么区别以及它们在什么情况下效果最好？

最佳答案

扩展我在勘误表中的 wiki-walk 评论和 noting some of the ground-floor literature on the comparability of algorithms that apply to similar problem spaces,在确定这些算法在数值上是否具有可比性之前，让我们先探讨一下它们的适用性。

来自维基百科，Jaro-Winkler :

In computer science and statistics, the Jaro–Winkler distance (Winkler, 1990) is a measure of similarity between two strings. It is a variant of the Jaro distance metric (Jaro, 1989, 1995) and mainly[citation needed] used in the area of record linkage (duplicate detection). The higher the Jaro–Winkler distance for two strings is, the more similar the strings are. The Jaro–Winkler distance metric is designed and best suited for short strings such as person names. The score is normalized such that 0 equates to no similarity and 1 is an exact match.

Levenshtein distance:

In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences. The term edit distance is often used to refer specifically to Levenshtein distance.

The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965.

Euclidean distance:

In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. By using this formula as distance, Euclidean space (or even any inner product space) becomes a metric space. The associated norm is called the Euclidean norm. Older literature refers to the metric as Pythagorean metric.

和Q- or n-gram encoding:

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items in question can be phonemes, syllables, letters, words or base pairs according to the application. n-grams are collected from a text or speech corpus.

The two core advantages of n-gram models (and algorithms that use them) are relative simplicity and the ability to scale up – by simply increasing n a model can be used to store more context with a well-understood space–time tradeoff, enabling small experiments to scale up very efficiently.

问题是这些算法解决了不同的问题，这些问题在解决longest common subsequence的所有可能算法的空间内具有不同的适用性。问题，在您的数据中或嫁接可用的metric其中。事实上，并非所有这些都是指标，因为其中一些不满足 triangle inequality .

不要特意定义一个可疑的方案来检测数据损坏，正确地执行此操作:使用 checksums和 parity bits当更简单的解决方案可以解决问题时，不要试图解决更困难的问题。

关于levenshtein-distance - 比较相似度算法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/9842188/

29

4

0

文章推荐： excel - 返回带有地址的单元格值

文章推荐： html - (webrtc)setRemoteDescription对启动器不起作用

文章推荐： elasticsearch - 如何查询一个elasticsearch join字段名？

文章推荐： Java 屏幕虚拟操纵杆控制

levenshtein-distance - Damerau-Levenshtein php
我正在寻找 Damerau–Levenshtein 的实现PHP 的算法，但我的 friend google 似乎找不到任何东西。到目前为止，我必须使用 PHP 实现的 Levenshtein(没有
java - Levenshtein 到 Damerau-Levenshtein
我坐在这里用 Java 为我的主程序编写一些算法(这是迄今为止的第一个)。我对 levenshtein 算法进行了很好的编程，这要归功于 wiki 对新手的伪代码非常好，还有一个很好的教程 :D 然后
algorithm - Levenshtein Automata 和 Damerau-Levenshtein 距离算法有什么区别？
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。问题必须表现出对正在解决的问题的最低限度的理解。告诉我们您尝试过的方法、为什么不起作用以及它应该起作用
PHP改进计算字符串相似度的函数similar_text()、levenshtein()
similar_text()中文汉字版复制代码代码如下: <?php &nbs
levenshtein-distance - 编辑器自动机
我实现了一个 levenshtein trie 来查找与给定单词相似的单词。我的目标是有一种快速的方法来进行拼写纠正。但是我发现有一种更快的方法可以做到这一点: 莱文斯坦自动机我只是有一个问题.
levenshtein-distance - Levenshtein和Trigram的替代品
说我的数据库中有以下两个字符串： (1) 'Levi Watkins Learning Center - Alabama State University' (2) 'ETH Library' 我的软
python - Levenshtein 距离给出奇怪的值
这是一个字符串 T : 'men shirt team brienne funny sarcasm shirt features graphic tees mugs babywear much rea
levenshtein-distance - 比较相似度算法
我想使用字符串相似度函数来查找数据库中损坏的数据。我遇到了其中几个: 贾罗，贾罗-温克勒，编辑，欧几里得和 Q-gram，我想知道它们之间有什么区别以及它们在什么情况下效果最好？最佳答案
python - Levenshtein 函数查找最接近的名称
我需要一些有关以下代码的帮助。在这种情况下，我需要找到与输入的单词最接近的单词来测试我将 word_0 设置为“pikaru”，它应该返回“pikachu”。 levenshtein 函数返回我们输入
mysql - Levenshtein 无法查找俄语单词
我有一个脚本可以使用 Levenshtein 在数据库中搜索单词。当我搜索英文单词时一切正常，但是当我搜索俄语单词时，MySQL 控制台报错: [22007][1366] (conn=31079) I
c - levenshtein 总是无限循环递归C
列支敦士登在c编程中总是返回无限循环这是我的代码我尝试了很多解决方案并且我尝试存储变量并使用指针但总是有无限循环我认为这是因为3个递归调用但在列支敦士登算法的文档中我找到了这个实现 #include
javascript - Levenshtein 阵列测距
有什么方法可以对数组使用 Levenshtein Distance例如我有一个包含多个文本的 div one,two,three,longtext,anything 和一个输入 // sometex
java - Levenshtein 距离的并行实现随着线程的增加而变慢
这是我为了好玩而编写的 Levenshtein 距离的并行实现。我对结果很失望。我在核心 i7 处理器上运行它，所以我有很多可用线程。但是，当我增加线程数时，性能会显着下降。我的意思是，对于相同大小的
PHP Levenshtein 查询结果
我想对 mysql 查询结果执行编辑。查询如下所示: $query_GID = "select `ID`,`game` from `gkn_catalog`"; $result_GID = $dbc
Postgresql levenshtein 和预组合字符与组合字符
我有包含两个相似字符的字符串。两者都显示为带有 ogonek 的小“a”: ± ± (注意:根据渲染器，它们有时呈现相似，有时略有不同) 但是，它们是不同的: 第一个字符的特征: 在 PostgreS
Elasticsearch:Levenshtein 排序
我有一个足够有效的查询，但我想通过在查询参数和相关字段之间使用 levenshtein 对结果进行排序。现在我在 ES 中进行查询，然后在我的应用程序中进行排序。现在我正在测试脚本字段。这是脚本 i
mysql - Levenshtein 距离公式在大数据库中的性能缓慢
我使用此查询来搜索公司详细信息 select * from company_details where levenshtein_ratio('New York Life Insurance Compa
python - Levenshtein 距离与字符加扰？
我正在寻找一个字符串比较指标 ala Levenshtein，当字符串中的字符被打乱时，它也可以工作。有谁知道这样的指标？如果有一个 Python 模块可以计算这样的指标，那就太好了。谢谢! 最佳答案
python - Levenshtein 实现能够处理大字符串和向量
R 中有一个名为stringdist 的package，它包含计算Levenshtein 字符串距离的函数。这个包有两个问题: 1st 它不适用于大字符串，例如: set.seed(1) a.str
python - 导入模块，Levenshtein
我正在编写一个使用比较来确定模糊匹配的脚本，因此我正在使用 Levenshtein 功能。不幸的是，当我在终端窗口中运行 easy_install python-Levenshtein 时，当我在其

首页

博学

6Ren·AI

商城

levenshtein-distance - 比较相似度算法