gpt4 book ai didi

python - 过滤器在Elasticsearch中不起作用

转载 作者:行者123 更新时间:2023-12-02 22:40:41 24 4
gpt4 key购买 nike

我具有以下索引的映射和设置:

def init_index():
ES_CLIENT.indices.create(
index = "social_media",
body = {
"settings": {
"index": {
"number_of_shards": 3,
"number_of_replicas": 0
},
"analysis": {
"analyzer": {
"my_english": {
"type": "standard",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"cust_stop",
"my_snow"
]
},
"my_english_shingle": {
"type": "standard",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"cust_stop",
"my_snow",
"shingle_filter"
]
}
},
"filter": {
"cust_stop": {
"type": "stop",
"stopwords_path": "stoplist.txt",
},
"shingle_filter" : {
"type" : "shingle",
"min_shingle_size" : 2,
"max_shingle_size" : 2,
"output_unigrams": True
},
"my_snow" : {
"type" : "snowball",
"language" : "English"
}
}
}
}
}
)

press_mapping = {
"tweet": {
"dynamic": "strict",
"properties": {
"_id": {
"type": "string",
"store": True,
"index": "not_analyzed"
},
"text": {
"type": "multi_field",
"fields": {
"text": {
"include_in_all": False,
"type": "string",
"store": False,
"index": "not_analyzed"
},
"_analyzed": {
"type": "string",
"store": True,
"index": "analyzed",
"term_vector": "with_positions_offsets",
"analyzer": "my_english"
},
"_analyzed_shingles": {
"type": "string",
"store": True,
"index": "analyzed",
"term_vector": "with_positions_offsets",
"analyzer": "my_english_shingle"
}
}
}
}
}
}

constants.ES_CLIENT.indices.put_mapping (
index = "social_media",
doc_type = "tweet",
body = press_mapping
)

我注意到,除了 lowercase之外,没有其他过滤器正在运行。两个分析器的术语 vector 相同,因为 shingle_filter也不起作用。
GET /social_media/_analyze?analyzer=my_english_shingle&text=complaining when应该删除 when,将 complaining改为 complain并返回一个带状的 complain _,但是它却给了我:
{
"tokens": [
{
"token": "complaining",
"start_offset": 0,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "when",
"start_offset": 12,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 2
}
]
}

可能是什么原因??

最佳答案

由于您要尝试定义新的custom analyzers而不是新的standard analyzers,因此需要将两个分析器的映射类型都从standard更改为custom。标准分析器实际上不采用您在映射中传递的任何设置-在这种情况下,个人更希望ES抛出异常,但他只是在创建没有自定义字段的新标准分析器,而忽略您传递的其他所有内容(尝试从分析仪中删除lowercase并重新运行分析仪,输出仍将小写!):

"analyzer": {
"my_english": {
"type": "custom", // <--- CUSTOM
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"stop",
"my_snow"
]
},
"my_english_shingle": {
"type": "custom", // <--- CUSTOM
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"stop",
"my_snow",
"shingle_filter"
]
}

使用此查询(由于我没有您的文件,我将查询和自定义停用词更改为 stop) GET /social_media/_analyze?analyzer=my_english_shingle&text=COMPLAINING TEST返回:
{
"tokens": [
{
"token": "complain",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "complain test",
"start_offset": 0,
"end_offset": 16,
"type": "shingle",
"position": 1
},
{
"token": "test",
"start_offset": 12,
"end_offset": 16,
"type": "word",
"position": 2
}
]
}

也不确定您的ES版本,但我需要将 bool(boolean) 值 truefalse转换为小写。

关于python - 过滤器在Elasticsearch中不起作用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31827046/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com