gpt4 book ai didi

solr - solr中的多词同义词

转载 作者:行者123 更新时间:2023-12-04 12:54:25 24 4
gpt4 key购买 nike

我正在尝试在 solr 中实现多词同义词,特别是类型

msc divina => divina

因此,如果用户输入“msc divina”,solr 应仅返回“divina”的结果。

schema.xml 中的定义如下所示:

<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100" 
autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms_de.txt"
ignoreCase="true"
expand="false" />
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords_de.txt"
enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords_de.txt" />
<filter class="solr.SnowballPorterFilterFactory" language="German2" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords_de.txt"
enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords_de.txt" />
<filter class="solr.SnowballPorterFilterFactory" language="German2" />
</analyzer>
</fieldType>

它不起作用。如果我向查询分析器添加同义词过滤器,则对“msc divina”的搜索将返回“msc 和“divina”的每个匹配项。

我该如何解决这个问题?

最佳答案

开始Solr 6.4 对于多词同义词,您需要使用 solr.SynonymGraphFilterFactory

This filter maps single- or multi-token synonyms, producing a fully correct graph output. This filter is a replacement for the Synonym Filter, which produces incorrect graphs for multi-token synonyms.

If you use this filter during indexing, you must follow it with a Flatten Graph Filter to squash tokens on top of one another like the Synonym Filter, because the indexer can’t directly consume a graph. To get fully correct positional queries when your synonym replacements are multiple tokens, you should instead apply synonyms using this filter at query time.



索引时间分析器示例:
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="mysynonyms.txt"/>
<filter class="solr.FlattenGraphFilterFactory"/> <!-- required on index analyzers after graph filters -->
</analyzer>

由于现在标记流是图形 - 将为文件的多词同义词提供适当的弧
fast → speedy
wi fi → wifi
wi fi network → hotspot

enter image description here

在这种情况下 - 多字将正常工作。

引用 McCandless 博客文章 - http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

关于solr - solr中的多词同义词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19927537/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com