gpt4 book ai didi

elasticsearch - 如何修改标准分析仪以包含#?

转载 作者:行者123 更新时间:2023-12-02 22:44:43 24 4
gpt4 key购买 nike

有些字符被当作定界符,例如#,因此它们在查询中将永远不匹配。最接近标准的定制分析器配置应该是什么,以允许匹配这些字符?

最佳答案

1)最简单的方法是将whitespace tokenizerlowercase filter一起使用。

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase&pretty' -d 'new year #celebration vegas'

这会给你
{
"tokens" : [ {
"token" : "new",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "year",
"start_offset" : 4,
"end_offset" : 8,
"type" : "word",
"position" : 2
}, {
"token" : "#celebration",
"start_offset" : 9,
"end_offset" : 21,
"type" : "word",
"position" : 3
}, {
"token" : "vegas",
"start_offset" : 22,
"end_offset" : 27,
"type" : "word",
"position" : 4
} ]
}

2)如果只想保留一些特殊字符,则可以使用 char filter映射它们,以便在 tokenization发生之前将您的文本转换为其他字符。这更接近 standard analyzer。例如,您可以像这样创建索引
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"special_analyzer": {
"char_filter": [
"special_mapping"
],
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"char_filter": {
"special_mapping": {
"type": "mapping",
"mappings": [
"#=>hashtag\\u0020"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"tweet": {
"type": "string",
"analyzer": "special_analyzer"
}
}
}
}
}

现在输入 curl -XPOST 'localhost:9200/my_index/_analyze?analyzer=special_analyzer&pretty' -d 'new year #celebration vegas'自定义分析器将生成以下 token
{
"tokens" : [ {
"token" : "new",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "year",
"start_offset" : 4,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "hashtag",
"start_offset" : 9,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "celebration",
"start_offset" : 10,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 4
}, {
"token" : "vegas",
"start_offset" : 22,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 5
} ]
}

所以你可以这样搜索
GET my_index/_search
{
"query": {
"match": {
"tweet": "#celebration"
}
}
}

您也将只能搜索庆祝 Activity ,因为我已将unicode用作 \\u0020空格,否则我们将始终必须使用 #进行搜索

希望这可以帮助!!

关于elasticsearch - 如何修改标准分析仪以包含#?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34754057/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com