gpt4 book ai didi

python - 在 Python 中删除相似文档

转载 作者:塔克拉玛干 更新时间:2023-11-03 05:31:51 25 4
gpt4 key购买 nike

我有一个包含系列字幕的文件夹。我想从文件夹中获取每集一个字幕文件。我的问题是某些字幕在同一集中但名称不同,例如

/data/netfilx/reality_subtitle/Top Chef/Top.Chef-Texas.S09E02.720p.HDTV.x264-MOMENTUM.HI.srt
/data/netfilx/reality_subtitle/Top Chef/Top.Chef-Texas.902.720p.HDTV.x264.MOMENTUM.srt
/data/netfilx/reality_subtitle/Top Chef/Top.Chef-Texas.9X02.HDTV.XviD-MOMENTUM.HI.srt
/data/netfilx/reality_subtitle/Top Chef/Top.Chef-Texas.S09E02.HDTV.XviD-MOMENTUM.srt

所以它们非常相似,但不是 100% 相同。

如何删除重复的文档并只保留不同的剧集字幕?
我会附上我尝试过的东西,但不幸的是我很无能......

最佳答案

您可以使用文档之间的余弦相似度

假设相似的文档会有很高的相似度,然后您可以应用一个阈值,高于该阈值的文档将被视为相同。

例如,如果这些是您的文档:

1."The child went home today, and his mother waited for him"
2."My car is big"
3."The kid went to his house today, while his mama waited for him to come"

我使用来自 the answervpekar 代码并执行以下操作:

>>> v1 = text_to_vector("the child went home today, and his mother waited for him")
>>> v2 = text_to_vector("My car is big, so said my mother")
>>> v3 = text_to_vector("The kid went to his house today, while his mama waited for him to come")

向量之间的余弦相似度为:

>>> get_cosine(v1,v2)
0.10660035817780521

>>> get_cosine(v1,v3)
0.48420012470625223

>>> get_cosine(v2,v3)
0.0

所以你显然看到文档 1 和 3 是最相似的 - 因此可能是同一集的字幕。所以,总结一下:

1. you need to apply (n choose 2) comparisons (check every possible pair).
2. If the cosine similarity between 2 documents is higher then a threshold you will find by trial and error -
the subtitles are probably of the same episode - and you should remove one of them.

关于python - 在 Python 中删除相似文档,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42903174/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com