gpt4 book ai didi

html - Elasticsearch : Strip HTML tags before indexing docs with html_strip filter not working

转载 作者:技术小花猫 更新时间:2023-10-29 12:15:51 25 4
gpt4 key购买 nike

鉴于我已经在我的自定义分析器中指定了我的 html strip char 过滤器

我用 html 内容索引文档

然后我希望从索引内容中删除 html

并且在从索引中检索返回的文档时不应包含 hmtl

实际:索引文档包含 html检索到的文档包含 html

我已经尝试将分析器指定为 index_analyzer,正如人们所期望的那样,还有一些出于绝望的 search_analyzer 和分析器。 Non 似乎对正在索引或检索的文档有任何影响。

针对 HTML_Strip Analyzed 字段测试文档索引:

请求:带有 html 内容的示例 POST 文档

POST /html_poc_v2/html_poc_type/02
{
"description": "Description <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
"title": "Title <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
"body": "Body <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>"
}

预期:已通过 html 分析器解析的索引数据。Actual : data is indexed with html

响应

{
"_index": "html_poc_v2", "_type": "html_poc_type", "_id": "02", ...
"_source": {
"description": "Description <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
"title": "Title <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
"body": "Body <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>"
}
}

设置和文档映射

PUT /html_poc_v2
{
"settings": {
"analysis": {
"analyzer": {
"my_html_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
]
}
}
},
"mappings": {
"html_poc_type": {
"properties": {
"body": {
"type": "string",
"analyzer": "my_html_analyzer"
},
"description": {
"type": "string",
"analyzer": "my_html_analyzer"
},
"title": {
"type": "string",
"search_analyser": "my_html_analyzer"
},
"urlTitle": {
"type": "string"
}
}
}
}
}
}

测试以证明 Custom Analyzer 完美运行:

请求

GET /html_poc_v2/_analyze?analyzer=my_html_analyzer
{<p>Some d&eacute;j&agrave; vu <a href="http://somedomain.com>">website</a>}

响应

{
"tokens": [
{
"token": "Some",… "position": 1
},
{
"token": "déjà",… "position": 2
},
{
"token": "vu",… "position": 3
},
{
"token": "website",… "position": 4
}
]
}

引擎盖下

通过内联脚本进一步证明我的 html 分析器一定被跳过了

请求

GET /html_poc_v2/html_poc_type/_search?pretty=true
{
"query" : {
"match_all" : { }
},
"script_fields": {
"terms" : {
"script": "doc[field].values",
"params": {
"field": "title"
}
}
}
}

响应

{ …
"hits": { ..
"hits": [
{
"_index": "html_poc_v2",
"_type": "html_poc_type",

"fields": {
"terms": [
[
"a",
"agrave",
"d",
"eacute",
"href",
"http",
"j",
"p",
"some",
"somedomain.com",
"title",
"vu",
"website"
]
]
}
}
]
}
}

类似于这里的这个问题:Why HTML tag is searchable even if it was filtered in elastic search

我也读过这个很棒的文档:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html

ES 版本:1.7.2

请帮助。

最佳答案

您混淆了响应中的“_source”字段以返回正在分析和索引的内容。看起来您的期望是响应中的 _source 字段返回分析后的文档。这是不正确的。

来自documentation ;

The _source field contains the original JSON document body that was passed at index time. The _source field itself is not indexed (and thus is not searchable), but it is stored so that it can be returned when executing fetch requests, like get or search.

理想情况下,在上述情况下,您希望格式化源数据以进行展示,这应该在客户端完成。

然而,据说实现上述用例的一种方法是使用 script fieldskeyword-tokenizer如下:

PUT test
{
"settings": {
"analysis": {
"analyzer": {
"my_html_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
]
},
"parsed_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"char_filter": [
"html_strip"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"body": {
"type": "string",
"analyzer": "my_html_analyzer",
"fields": {
"parsed": {
"type": "string",
"analyzer": "parsed_analyzer"
}
}
}
}
}
}
}


PUT test/test/1
{
"body" : "Title <p> Some d&eacute;j&agrave; vu <a href='http://somedomain.com'> website </a> <span> this is inline </span></p> "
}

GET test/_search
{
"query" : {
"match_all" : { }
},
"script_fields": {
"terms" : {
"script": "doc[field].values",
"params": {
"field": "body.parsed"
}
}
}
}

结果:

{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 1,
"fields": {
"terms": [
"Title \n Some déjà vu website this is inline \n "
]
}
}

注意我认为以上是一个坏主意,因为剥离 html 标记可以很容易地在客户端实现,并且与依赖于诸如此类的变通方法相比,您在格式化方面拥有更多的控制权。更重要的是,它可能会在客户端执行此操作。

关于html - Elasticsearch : Strip HTML tags before indexing docs with html_strip filter not working,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37351900/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com