gpt4 book ai didi

mysql - Jaro-winkler 函数 : why is the same score matching very similar and very different words?

转载 作者:行者123 更新时间:2023-11-29 03:18:59 29 4
gpt4 key购买 nike

我正在使用 jaro-winkler 模糊匹配来匹配名称。

我正在尝试确定相似性分数的截止范围。如果名称差异太大,我想将它们排除在外以供人工审核。

虽然 .4 以下的名称似乎完全不同,但 .4 范围似乎非常相似。

但后来我遇到了奇怪的异常(exception)情况,其中该范围内的一些名称完全不同,而有些名称仅相差一两个字母(请参见下面的示例)。

有人可以解释在相同的匹配分数范围内匹配的差异很大吗?

   Estrella     ANNELISE    0.42 
Arienna IREANNA 0.43
Tayvia I TAYVIA 0.43
Amanda IZABEL 0.44
Hunter JOSHUA 0.44
Ryder CHARLES 0.45
Luis ELIZABETH 0.45
Sebastian JOSE 0.45
Christopher CHISTOPHE 0.46
Genayunique GENAY-UNI 0.46
Andreeaonn ADREEAONN 0.46
Chistopher CHRISTOPH 0.46
Dazharicon DAZHARION 0.46
Jennavecia JENNACVEC 0.46
Valentiria VALENTINA 0.46
Abel SAMMUEL 0.46
Dezarea MarieDEZAREA 0.47
Alexander ALEXZANDE 0.47

最佳答案

Jaro-Winkler 距离公式偏向于具有共同开头的字符串。例如,Valentina 和 Valentiria

它还有一些不太直观的“规则”(参见 wikipedia)。

您可能应该首先确定您期望的差异类型,然后寻找合适的距离公式。例如,在写作中,“angleworm”和“angelworm”很可能会出错,所以这两个字符串之间的距离应该很小。虽然“there”和“three”不匹配的可能性较小,但“ether”更是如此。对于更长的字谜,Jaro 距离可能完全相同,甚至 Winkler 修正也可能不会生效。

正如您在 this page 中所读到的那样(强调我的)

Beyond the optimization for empty strings and those which are exactly the same, you can see here that I weight the first character even more heavily. This is due to my data being very initial heavy.

To compensate for the frequent use of middle initials I count Jaro-Winkler distance as 80% of the score, while the remaining 20% is fully based on the first character matching. The value of p here was determined by the results of heavy experimentation and hair pulling. Before making this extension initials would frequently align incorrectly.

关于mysql - Jaro-winkler 函数 : why is the same score matching very similar and very different words?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48406993/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com