gpt4 book ai didi

elasticsearch - Elasticsearch:如何计算docFreq

转载 作者:行者123 更新时间:2023-12-02 22:27:46 24 4
gpt4 key购买 nike

我试图了解docFreq是如何计算的。是每个索引,每个字段的每个映射吗?

将explain设置为true时,我从查询中得到了这些结果。
当命中在映射中时,ListedName.standard docFreq较低,如下所示

 {
"value" : 16.316673,
"description" : """weight(ListedName.standard:"eagle pointe" in 48) [PerFieldSimilarity], result of:""",
"details" : [
{
"value" : 16.316673,
"description" : "score(doc=48,freq=1.0 = phraseFreq=1.0\n), product of:",
"details" : [
{
"value" : 3.0,
"description" : "boost",
"details" : [ ]
},
{
"value" : 5.4388914,
"description" : "idf(), sum of:",
"details" : [
{
"value" : 1.7870536,
"description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details" : [
{
"value" : 35.0,
"description" : "docFreq",
"details" : [ ]
},
{
"value" : 211.0,
"description" : "docCount",
"details" : [ ]
}
]
},
{
"value" : 3.651838,
"description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details" : [
{
"value" : 5.0,
"description" : "docFreq",
"details" : [ ]
},
{
"value" : 211.0,
"description" : "docCount",
"details" : [ ]
}
]
}
]
},
{
"value" : 1.0,
"description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) from:",
"details" : [
{
"value" : 1.0,
"description" : "phraseFreq=1.0",
"details" : [ ]
},
{
"value" : 0.0,
"description" : "parameter k1",
"details" : [ ]
},
{
"value" : 0.0,
"description" : "parameter b (norms omitted for field)",
"details" : [ ]
}
]
}
]
}
]
},

而当匹配位于 映射中时,第1行 docFreq较高,如下所示
  {
"value" : 1.1640041,
"description" : """weight(Line1:"eagle pointe" in 148) [PerFieldSimilarity], result of:""",
"details" : [
{
"value" : 1.1640041,
"description" : "score(doc=148,freq=1.0 = phraseFreq=1.0\n), product of:",
"details" : [
{
"value" : 3.0,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.38800138,
"description" : "idf(), sum of:",
"details" : [
{
"value" : 0.18813552,
"description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details" : [
{
"value" : 171.0,
"description" : "docFreq",
"details" : [ ]
},
{
"value" : 206.0,
"description" : "docCount",
"details" : [ ]
}
]
},
{
"value" : 0.19986586,
"description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details" : [
{
"value" : 169.0,
"description" : "docFreq",
"details" : [ ]
},
{
"value" : 206.0,
"description" : "docCount",
"details" : [ ]
}
]
}
]
},
{
"value" : 1.0,
"description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) from:",
"details" : [
{
"value" : 1.0,
"description" : "phraseFreq=1.0",
"details" : [ ]
},
{
"value" : 0.0,
"description" : "parameter k1",
"details" : [ ]
},
{
"value" : 0.0,
"description" : "parameter b (norms omitted for field)",
"details" : [ ]
}
]
}
]
}
]
}

最佳答案

它应取决于评分模型(参见Similarity)的定义方式,可以基于每个索引或每个字段设置相似性算法。

Elasticsearch allows you to configure a scoring algorithm or similarity per field. The similarity setting provides a simple way of choosing a similarity algorithm other than the default BM25, such as TF/IDF.



现在,我们可以在评分说明输出中看到:
weight(<field>:"eagle pointe" in 48) [PerFieldSimilarity]

在这种情况下, docFreq似乎仅限于该字段中包含该术语的文档数量。但是,我没有找到关于此的任何扩展信息,也不确定背后的逻辑,因为它应该取决于类相似性定义本身,而不取决于在特定字段上设置自定义对象的事实。

可以为整个索引设置默认的相似性,并在映射设置中为每个字段指定一个相似性:请参见 Elasticsearch Reference [7.2] » Index modules » Similarity module

您可能要检查哪个相似性用作默认值,以及是否有任何字段映射覆盖它。为了进行测试,我会尝试将默认值重置为“经典”(tf-idf),并删除这两个字段的所有现有覆盖,以再次检查 docFreq是否在各个字段之间保持一致(这可能是一个错误)。

cf. Lucene's TFIDFSimilarity

关于elasticsearch - Elasticsearch:如何计算docFreq,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57024432/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com