gpt4 book ai didi

python - Levenshtein 距离给出奇怪的值

转载 作者:行者123 更新时间:2023-12-03 14:31:12 33 4
gpt4 key购买 nike

这是一个字符串 T :

'men shirt team brienne funny sarcasm shirt features graphic tees mugs babywear much real passion brilliant design detailed illustration strong appreciation things creative br shop thousands designs found across different shirt babywear mugs funny pop culture abstract witty many designs brighten day well day almost anyone else meet ul li quality short sleeve crew neck shirts 100 cotton soft durable comfortable feel fit standard size doubt l xl available li li sustainability label company conceived belief textiles industry start acting lot responsibly made cotton li li clothing printed using state art direct garment equipment crack peel washed li li graphic tee designs professionally printed unique design look great make someone smile funny cute vintage expressive artwork li ul'


我已经突出显示了上面字符串的一部分,因为上面是字符串的预处理版本,因此可能难以阅读。
我得到以下值: fuzz.partial_ratio('short sleeve', T)50 fuzz.partial_ratio('long sleeve', T)73 fuzz.partial_ratio('dsfsdf sleeve', T)62 fuzz.partial_ratio('sleeve', T)50我对此感到非常困惑。第一个和第四个值不应该是 100 吗?当然我错过了一些东西,但我无法弄清楚。
编辑:这是我在卸载 python-Levenshtein 库后运行的另一个示例:

'first succeed way wife told v 2 long sleeve shirt id 1084 first succeed way wife told v 2 long sleeve shirt design printed quality 100 long sleeve cotton shirt sports gray 90 cotton 10 polyester standard long sleeve shirts fashion fit tight fitting style please check size chart listed additional image feel free contact us first sizing questions satisfaction 100 guaranteed shirts usually ship business day ordered noon est next business day ordered noon est long sleeve shirts 100 cotton standard shirt fashion fit combined shipping multiple items'

fuzz.partial_ratio('long sleeve', T)给出 27 fuzz.partial_ratio('short sleeve', T)给 33 fuzz.partial_ratio('sleeveless', T)给 40 fuzz.partial_ratio('dsfasd sleeve', T)给 23
不幸的是,这个问题似乎不是 python-Levenshtein 库独有的。

最佳答案

fuzzywuzzy 中有一个非常奇怪和微妙的错误。图书馆某处。
如果我们运行以下

from fuzzywuzzy import fuzz

fuzz.partial_ratio('funny', 'aa aaaaa aaaa aaaaaaa funny aaaaaaa aaaaaaaa aaaaaaa aaaa aaaa aaayaaaa auaa aaaa aaaaaaaa aaaaaaaaa aaaaaa aaaaaaaa aaaaa aaaa aa aaaaaaaaaaa aaaaaa aaaffaaaaaaa aaaaa aaayaaaa auaa funny aaaa aaaaaa')
它返回 0而如果我们从这个字符串的开头删除一个字母:
fuzz.partial_ratio('funny', 'a aaaaa aaaa aaaaaaa funny aaaaaaa aaaaaaaa aaaaaaa aaaa aaaa aaayaaaa auaa aaaa aaaaaaaa aaaaaaaaa aaaaaa aaaaaaaa aaaaa aaaa aa aaaaaaaaaaa aaaaaa aaaffaaaaaaa aaaaa aaayaaaa auaa funny aaaa aaaaaa')
它返回 100(对长而可怕的字符串感到抱歉。我试图将其简化为尽可能简单的字符串,但我似乎看不到驱动此错误的逻辑)
好像有 similar bug reports在 Github 上。
安装 python-Levenshtein似乎修复了我上面的示例(如果未安装 difflib,fuzzywuzzy 将恢复为 python-Levenshtein),但不会更改您的原始示例。
python-Levenshtein安装后,我可以将您的示例简化为:
fuzz.partial_ratio('sleeve', 's l e e v sleeve e ')
返回 50 .
从较长的字符串中删除第一个字母:
fuzz.partial_ratio('sleeve', 'l e e v sleeve e ')
返回 100 .
这提供了有关可能发生的事情的某种提示,但我怀疑这需要深入了解 python-Levenshtein弄清楚。
我的推荐?提交错误报告。然后找到另一个库来比较字符串。 RapidFuzz可能是一个合适的选择。
更新:
我认为这个bug可能与 opcodes的使用有关来自 python-Levenshtein图书馆:
from Levenshtein import opcodes

opcodes('sleeve', 's l e e v sleeve e ')
返回:
[('equal', 0, 1, 0, 1),
('insert', 1, 1, 1, 2),
('equal', 1, 2, 2, 3),
('insert', 2, 2, 3, 4),
('equal', 2, 3, 4, 5),
('insert', 3, 3, 5, 6),
('equal', 3, 4, 6, 7),
('insert', 4, 4, 7, 8),
('equal', 4, 5, 8, 9),
('insert', 5, 5, 9, 12),
('equal', 5, 6, 12, 13),
('insert', 6, 6, 13, 19)]
fuzzywuzzy 中使用时,这显然不是预期的结果,即使这些是一组最少的编辑操作。在 fuzzywuzzy ,优先级应该放在连续块上,而 Levenshtein 距离的正式定义并没有优先考虑连续块和非连续块(至少我的理解不是这样)。请注意 difflib.SequenceMatcher.get_opcodes()给出不同的结果。
我怀疑需要一些非常仔细的考虑来修复这个错误并使其正确。

关于python - Levenshtein 距离给出奇怪的值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66738821/

33 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com