- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我正在尝试使用 Difflib.SequenceMatcher 来计算两个文件之间的相似性。这两个文件几乎完全相同,只是一个文件包含一些额外的空格、空行而另一个则没有。我正在尝试使用
s=difflib.SequenceMatcher(isjunk,text1,text2)
ratio =s.ratio()
为了这个目的。
所以,问题是如何为这个 isjunk 方法编写 lambda 表达式,以便 SequenceMatcher 方法将忽略所有空格、空行等。我尝试使用参数 lambda x: x=="",但结果没那么好。对于两个非常相似的文本,该比率非常低。这是非常违反直觉的。
出于测试目的,以下是您可以在测试中使用的两个字符串:
What Motivates jwovu to do your Job Well? OK, this is an entry trying to win $100 worth of software development books despite the fact that I don‘t read
programming books. In order to win the prize you have to write an entry and
what motivatesfggmum to do your job well. Hence this post. First motivationmoney. I know, this doesn‘t sound like a great inspiration to many, and saying that money is one of the motivation factors might just blow my chances away.
As if money is a taboo in programming world. I know there are people who can‘t be motivated by money. Mme, on the other hand, am living in a real world,
with house mortgage to pay, myself to feed and bills to cover. So I can‘t really exclude money from my consideration. If I can get a large sum of money for
doing a good job, then definitely boost my morale. I won‘t care whether I am using an old workstation, or forced to share rooms or cubicle with other
people, or have to put up with an annoying boss, or whatever. The fact that at the end of the day I will walk off with a large pile of money itself is enough
for me to overcome all the obstacles, put up with all the hard feelings and hurt egos, tolerate a slow computer and even endure
这是另一个字符串
What Motivates You to do your Job Well? OK, this is an entry trying to win $100 worth of software development books, despite the fact that I don't read programming books. In order to win the prize you have to write an entry and describes what motivates you to do your job well. Hence this post.
First motivation, money. I know, this doesn't sound like a great inspiration to many, and saying that money is one of the motivation factors might just blow my chances away. As if money is a taboo in programming world. I know there are people who can't be motivated by money. Kudos to them. Me, on the other hand, am living in a real world, with house mortgage to pay, myself to feed and bills to cover. So I can't really exclude money from my consideration.
If I can get a large sum of money for doing a good job, then thatwill definitely boost my morale. I won't care whether I am using an old workstation, or forced to share rooms or cubicle with other people, or have to put up with an annoying boss, or whatever. The fact that at the end of the day I will walk off with a large pile of money itself is enough for me to overcome all the obstacles, put up with all the hard feelings and hurt egos, tolerate a slow computer and even endure
我运行了上面的命令,并将 isjunk 设置为 lambda x:x=="",比率仅为 0.36。
最佳答案
如果匹配所有空格,则相似性更好:
difflib.SequenceMatcher(lambda x: x in " \t\n", doc1, doc2).ratio()
但是,difflib 不是解决此类问题的理想选择,因为这是两个几乎相同的文档,但拼写错误等会为 difflib 产生人类看不到的差异。
尝试阅读 tf-idf , Bayesian probability , Vector space Models和 w-shingling
我写了一个implementation of tf-idf将其应用于向量空间并使用点积作为距离度量来对文档进行分类。
关于python - Difflib.SequenceMatcher isjunk可选参数查询: how to ignore whitespaces,制表符,空行?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/147437/
在 python difflib 库中,SequenceMatcher 类的行为是否异常,或者我误读了假定的行为是什么? 为什么 isjunk 参数在这种情况下似乎没有任何区别? difflib.Se
我是一名优秀的程序员,十分优秀!