作者热门文章
- iOS/Objective-C 元类和类别
- objective-c - -1001 错误,当 NSURLSession 通过 httpproxy 和/etc/hosts
- java - 使用网络类获取 url 地址
- ios - 推送通知中不播放声音
我正在为我的学校开发一个 MongoDB 项目。我有一个句子集合,我做一个普通的文本搜索来找到集合中最相似的句子,这是基于评分的。
我运行此查询
db.sentences.find({$text: {$search: "any text"}}, {score: {$meta: "textScore"}}).sort({score:{$meta:"textScore"}})
当我查询句子时看看这些结果,
"that kicking a dog causes it pain"
----Matched With
"that kicking a dog causes it pain – is not very controversial."
----Give a Result of:
*score: 2.4*
"This sentence have nothing to do with any other"
----Matched With
"Who is the “He” in this sentence?"
----Give a result of:
*Score: 1.0*
什么是分数值?这是什么意思?如果我想显示只有 70% 及以上相似度的结果怎么办。
我如何解释得分结果以便显示相似性百分比,我使用 C# 来执行此操作,但不要担心实现。我不介意伪代码解决方案!
最佳答案
当您使用 MongoDB 文本索引时,它会为每个匹配的文档生成一个分数。该分数表示您的搜索字符串与文档的匹配程度。分数越高,与搜索文本相似的可能性就越大。分数计算方式:
Step 1: Let the search text = S
Step 2: Break S into tokens (If you are not doing a Phrase search). Let's say T1, T2..Tn. Apply Stemming to each token
Step 3: For every search token, calculate score per index field of text index as follows:
score = (weight * data.freq * coeff * adjustment);
Where :
weight = user Defined Weight for any field. Default is 1 when no weight is specified
data.freq = how frequently the search token appeared in the text
coeff = (0.5 * data.count / numTokens) + 0.5
data.count = Number of matching token
numTokens = Total number of tokens in the text
adjustment = 1 (By default).If the search token is exactly equal to the document field then adjustment = 1.1
Step 4: Final score of document is calculated by adding all tokens scores per text index field
Total Score = score(T1) + score(T2) + .....score(Tn)
如我们所见,分数受以下因素影响:
以下是您的一份文件的推导:
Search String = This sentence have nothing to do with any other
Document = Who is the “He” in this sentence?
Score Calculation:
Step 1: Tokenize search string.Apply Stemming and remove stop words.
Token 1: "sentence"
Token 2: "nothing"
Step 2: For every search token obtained in Step 1, do steps 3-11:
Step 3: Take Sample Document and Remove Stop Words
Input Document: Who is the “He” in this sentence?
Document after stop word removal: "sentence"
Step 4: Apply Stemming
Document in Step 3: "sentence"
After Stemming : "sentence"
Step 5: Calculate data.count per search token
data.count(sentence)= 1
data.count(nothing)= 1
Step 6: Calculate total number of token in document
numTokens = 1
Step 7: Calculate coefficient per search token
coeff = (0.5 * data.count / numTokens) + 0.5
coeff(sentence) = 0.5*(1/1) + 0.5 = 1.0
coeff(nothing) = 0.5*(1/1) + 0.5 = 1.0
Step 8: Calculate adjustment per search token (Adjustment is 1 by default. If the search text match exactly with the raw document only then adjustment = 1.1)
adjustment(sentence) = 1
adjustment(nothing) = 1
Step 9: weight of field (1 is default weight)
weight = 1
Step 10: Calculate frequency of search token in document (data.freq)
For ever ith occurrence, the data frequency = 1/(2^i). All occurrences are summed.
a. Data.freq(sentence)= 1/(2^0) = 1
b. Data.freq(nothing)= 0
Step 11: Calculate score per search token per field:
score = (weight * data.freq * coeff * adjustment);
score(sentence) = (1 * 1 * 1.0 * 1.0) = 1.0
score(nothing) = (1 * 0 * 1.0 * 1.0) = 0
Step 12: Add individual score for every token of search string to get total score
Total score = score(sentence) + score(nothing) = 1.0 + 0.0 = 1.0
同理,可以推导出另一个。
有关更详细的 MongoDB 分析,请查看: Mongo Scoring Algorithm Blog
关于MongoDB 全文搜索分数 "What does Score means?",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43041494/
我是一名优秀的程序员,十分优秀!