elasticsearch - 为什么 ElasticSearch 中的 "More Like This"不遵守单个术语的 TF-IDF 顺序？

转载作者：行者123 更新时间：2023-12-02 22:34:36

我一直在尝试理解 ElasticSearch 中的“More Like This”功能。我已经阅读并重新阅读了文档，但我无法理解为什么会出现以下行为。

基本上，我插入了三个文档，并尝试使用 max_query_terms=1 进行“更像这个查询”，期望使用更高的 TF-IDF 术语，但似乎并没有是这样的。

curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
   "message": "dog barks"
}';
curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
   "message": "cat fur"
}';
curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
   "message": "cat naps"
}';
curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/' -d '{
    "query": {
        "more_like_this" : {
            "like" : ["cat", "dog"],
            "fields" : ["message"],
            "minimum_should_match" : 1,
            "min_term_freq" : 1,
            "min_doc_freq" : 1,
            "max_query_terms" : 1
        }
    }
}';

预期输出:

“狗叫声”文档

实际输出:

“cat naps” 和 “cat fur” 文档(另外，请参阅下面关于确定性的注释)

预期输出说明:

在documentation它提到

Suppose we wanted to find all documents similar to a given input document. Obviously, the input document itself should be its best match for that type of query. And the reason would be mostly, according to Lucene scoring formula, due to the terms with the highest tf-idf. Therefore, the terms of the input document that have the highest tf-idf are good representatives of that document, and could be used within a disjunctive query (or OR) to retrieve similar documents. The MLT query simply extracts the text from the input document, analyzes it, usually using the same analyzer at the field, then selects the top K terms with highest tf-idf to form a disjunctive query of these terms.

由于我指定了 max_query_terms = 1，因此只有输入文档中具有最高 TF-IDF 分数的术语才应该用于析取查询。在这种情况下，输入文档有两个术语。它们在输入文档中具有相同的词频，但 cat 在语料库中出现的频率是其两倍，因此它具有更高的文档频率。因此，dog 的 TF-IDF 分数应该高于 cat，因此我希望析取查询只是 "message":"dog" 返回结果为"dog barks" 事件。

我想了解这里发生了什么。非常感谢任何帮助。 :)

关于确定性的注意事项

我尝试重新运行此设置几次。在 curl -XDELETE 'http://localhost:9200/samples' 命令之后运行上面的 4 个 ES 命令(3 POST + MLT GET)时，有时我会得到 "cat naps " 和 "cat fur"，但其他时候我会得到 "cat naps"、"cat fur"，以及“dog barks”，有几次我什至只能听到 “dog barks”。

完整输出

早些时候，我挥了挥手，只是说了 GET 查询的输出是什么。让我更准确实际输出 #1(有时会发生):

{"took":1,"timed_out":false,"_shards":
{"total":5,"successful":5,"skipped":0,"failed":0},"hits":
{"total":2,"max_score":0.6931472,"hits":
[{"_index":"samples","_type":"_doc","_id":"UHAoI3IBapDWjHWvsQ0_","_score":0.6931472,"_source":{
   "message": "cat fur"
}},{"_index":"samples","_type":"_doc","_id":"UXAoI3IBapDWjHWvsQ1c","_score":0.2876821,"_source":{
   "message": "cat naps"
}}]}}

实际输出 #2(有时会发生):

{"took":2,"timed_out":false,"_shards":
{"total":5,"successful":5,"skipped":0,"failed":0},"hits":
{"total":3,"max_score":0.2876821,"hits":
[{"_index":"samples","_type":"_doc","_id":"VHAtI3IBapDWjHWvvA0B","_score":0.2876821,"_source":{
   "message": "cat fur"
}},{"_index":"samples","_type":"_doc","_id":"U3AtI3IBapDWjHWvuw3l","_score":0.2876821,"_source":{
   "message": "dog barks"
}},{"_index":"samples","_type":"_doc","_id":"VXAtI3IBapDWjHWvvA0V","_score":0.2876821,"_source":{
   "message": "cat naps"
}}]}}

实际输出#3(三者中发生的最少):

{"took":1,"timed_out":false,"_shards":
{"total":5,"successful":5,"skipped":0,"failed":0},"hits":
{"total":1,"max_score":0.9808292,"hits":
[{"_index":"samples","_type":"_doc","_id":"WXAzI3IBapDWjHWvbQ3s","_score":0.9808292,"_source":{
   "message": "dog barks"
}}]}}

尝试间隔插入和更多 MLT

也许 elasticsearch 处于一种奇怪的“处理状态”，并且在文档之间需要一点时间。所以我在插入文档和运行 GET 命令之间给了 ES 一些时间。

filename="testEsOutput-10-incremental.txt"
amount=10
echo "Test-10-incremental"
for i in {1..10}
do
    curl -XDELETE 'http://localhost:9200/samples';
    sleep $amount
    curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
       "message": "dog barks"
    }';
    sleep $amount
    curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
       "message": "cat fur"
    }';
    sleep $amount
    curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
       "message": "cat naps"
    }';
    sleep $amount

    curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/' -d '{
        "query": {
            "more_like_this" : {
                "like" : ["cat", "dog"],
                "fields" : ["message"],
                "minimum_should_match" : 1,
                "min_term_freq" : 1,
                "min_doc_freq" : 1,
                "max_query_terms" : 1
            }
        }
    }' >> $filename
    echo "\n\r----\n\r" >> $filename
    echo "----\n\r" >> $filename
done
echo "Done!"

然而，这似乎并没有以任何有意义的方式影响非确定性输出。

尝试过 `search_type=dfs_query_then_fetch`

关注此SO post about ES nondeterminism ，我尝试添加 dfs_query_then_fetch 选项，又名

curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/?search_type=dfs_query_then_fetch' -d '{
        "query": {
            "more_like_this" : {
                "like" : ["cat", "dog"],
                "fields" : ["message"],
                "minimum_should_match" : 1,
                "min_term_freq" : 1,
                "min_doc_freq" : 1,
                "max_query_terms" : 1
            }
        }
    }'

但是，结果仍然不是确定性的，并且它们在三个选项之间有所不同。

附加说明

我试着通过查看额外的调试信息

curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_validate/query?rewrite=true' -d '{
    "query": {
        "more_like_this" : {
            "like" : ["cat", "dog"],
            "fields" : ["message"],
            "minimum_should_match" : 1,
            "min_term_freq" : 1,
            "min_doc_freq" : 1,
            "max_query_terms" : 1
        }
    }
}';

但这有时会输出

{"_shards":{"total":1,"successful":1,"failed":0},"valid":true,"explanations":
[{"index":"samples","valid":true,"explanation":"message:cat"}]}

其他时间

{"_shards":{"total":1,"successful":1,"failed":0},"valid":true,"explanations":
[{"index":"samples","valid":true,"explanation":"like:[cat, dog]"}]}

所以输出甚至不是确定性的(连续运行)。

注意:在 ElasticSearch 6.8.8 上进行了本地和在线 REPL 测试。还通过使用实际文档进行了测试，例如

curl -XPUT --header 'Content-Type: application/json' http://localhost:9200/samples/_doc/72 -d '{
   "message" : "dog cat"
}';
curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/' -d '{
    "query": {
        "more_like_this" : {
            "like" : {
                "_id" : "72"
            }
            ,
            "fields" : ["message"],
            "minimum_should_match" : 1,
            "min_term_freq" : 1,
            "min_doc_freq" : 1,
            "max_query_terms" : 1
        }
    }
}';

但得到了相同的 "cat naps" 和 "cat fur" 事件。

最佳答案

好的，经过多次调试，我尝试将索引限制为只有一个分片，也就是

curl -XPUT --header 'Content-Type: application/json' 'http://localhost:9200/samples' -d '{
    "settings" : {
        "index" : {
            "number_of_shards" : 1, 
            "number_of_replicas" : 0 
        }
    }
}';

当我这样做时，100% 的情况下，我只得到了 “dog barks” 文档。

似乎即使在使用 search_type=dfs_query_then_fetch 选项(使用多分片索引)时，ES 仍然没有完全准确地完成工作。我不确定我可以使用哪些其他选项来强制执行准确的行为。也许其他人可以更深入地回答。

关于elasticsearch - 为什么 ElasticSearch 中的 "More Like This"不遵守单个术语的 TF-IDF 顺序？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/61849718/

文章推荐： audio - mergExt-避免播放音频停止mergAVCamSet

文章推荐： audio - 使用 libsox 从 wav 文件中减少 channel

文章推荐： elasticsearch - 如何将AND条件与 'must'和 'should'结合？

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

elasticsearch - 为什么 ElasticSearch 中的 "More Like This"不遵守单个术语的 TF-IDF 顺序？

预期输出:

实际输出:

预期输出说明:

关于确定性的注意事项

完整输出

尝试间隔插入和更多 MLT

尝试过 `search_type=dfs_query_then_fetch`

附加说明

首页

博学

6Ren·AI

商城

elasticsearch - 为什么 ElasticSearch 中的 "More Like This"不遵守单个术语的 TF-IDF 顺序？

预期输出:

实际输出:

预期输出说明:

关于确定性的注意事项

完整输出

尝试间隔插入和更多 MLT

尝试过 search_type=dfs_query_then_fetch

附加说明

尝试过 `search_type=dfs_query_then_fetch`