gpt4 book ai didi

elasticsearch - elasticsearch match_phrase查询精确的子字符串搜索

转载 作者:行者123 更新时间:2023-12-02 23:44:21 26 4
gpt4 key购买 nike

我使用match_phrase查询来搜索全文匹配。

但是它没有按照我的想法工作。

查询:

POST /_search
{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"browsing_url": "/critical-illness"
}
}
],
"minimum_should_match": 1
}
}
}

结果:
"hits" : [
{
"_source" : {
"browsing_url" : "https://www.google.com/url?q=https://industrytoday.co.uk/market-research-industry-today/global-critical-illness-commercial-insurance-market-to-witness-a-pronounce-growth-during-2020-2025&usg=afqjcneelu0qvjfusnfjjte1wx0gorqv5q"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=critical+illness"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=critical+illness&tbm=nws"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=do+i+have+a+critical+illness+-insurance%3f"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=do+i+have+a+critical+illness%3f"
}
}
]

期望:
To only get results where the given string is an exact sub-string in the field. For example:

https://www.example.com/critical-illness OR
https://www.example.com/critical-illness-insurance


对应:
"browsing_url": {
"type": "text",
"norms": false,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}


结果不是我预期的。我希望得到的结果与搜索/ critical-病完全一样,作为存储的文本的子字符串。

最佳答案

您看到意外结果的原因是,您的搜索查询和字段本身都是通过analyzer运行的。分析人员会将文本分解成可以搜索的单个术语列表。这是使用_analyze端点的示例:

GET _analyze
{
"analyzer": "standard",
"text": "example.com/critical-illness"
}

{
"tokens" : [
{
"token" : "example.com",
"start_offset" : 0,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "critical",
"start_offset" : 12,
"end_offset" : 20,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "illness",
"start_offset" : 21,
"end_offset" : 28,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}

因此,尽管您的文档的真实值是 example.com/critical-illness,但在后台Elasticsearch将仅使用此 token 列表进行匹配。由于您使用的是 match_phrase,因此对您的搜索查询也是如此,它会对传入的短语进行 token 化。最终结果是Elasticsearch尝试将 token 列表 ["critical", "illness"]与文档 token 列表进行匹配。

在大多数情况下, standard analyzer会很好地删除不必要的标记,但是,在您的情况下,您会关心像 /这样的字符,因为您希望与之匹配。解决此问题的一种方法是使用不同的分析器,例如 reversed path hierarchy analyzer。以下是如何配置此分析器并将其用于 browsing_url字段的示例:
PUT /browse_history
{
"settings": {
"analysis": {
"analyzer": {
"url_analyzer": {
"tokenizer": "url_tokenizer"
}
},
"tokenizer": {
"url_tokenizer": {
"type": "path_hierarchy",
"delimiter": "/",
"reverse": true
}
}
}
},
"mappings": {
"properties": {
"browsing_url": {
"type": "text",
"norms": false,
"analyzer": "url_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}

现在,如果您分析一个URL,您将看到URL路径保持完整:
GET browse_history/_analyze
{
"analyzer": "url_analyzer",
"text": "example.com/critical-illness?src=blah"
}

{
"tokens" : [
{
"token" : "example.com/critical-illness?src=blah",
"start_offset" : 0,
"end_offset" : 37,
"type" : "word",
"position" : 0
},
{
"token" : "critical-illness?src=blah",
"start_offset" : 12,
"end_offset" : 37,
"type" : "word",
"position" : 0
}
]
}

这使您可以执行 match_phrase_prefix来查找所有URL包含 critical-illness路径的文档:
POST /browse_history/_search
{
"query": {
"match_phrase_prefix": {
"browsing_url": "critical-illness"
}
}
}

{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.7896894,
"hits" : [
{
"_index" : "browse_history",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.7896894,
"_source" : {
"browsing_url" : "https://www.example.com/critical-illness"
}
}
]
}
}

编辑:

修订前的先前答案是使用关键字字段和 regexp,但这是一个非常昂贵的查询。
POST /browse_history/_search
{
"query": {
"regexp": {
"browsing_url.keyword": ".*/critical-illness"
}
}
}

关于elasticsearch - elasticsearch match_phrase查询精确的子字符串搜索,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62460687/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com