gpt4 book ai didi

带/不带连字符的 Solr 搜索

转载 作者:行者123 更新时间:2023-12-01 03:44:07 25 4
gpt4 key购买 nike

我在尝试使用带连字符和不带连字符的单词获取相关搜索结果时遇到问题。我在“文本”字段中创建了两个文档,一个带有“wifi”,一个带有“wi-fi”。

搜索“wifi”时,两个文档都出现在搜索结果中,这很好。搜索“wi-fi”时,搜索结果中只会出现带有“wi-fi”的文档。

这是我的配置:

<field name="text" type="text" indexed="true" stored="true" omitNorms="true" />

<fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>

这是分析的结果: https://www.evernote.com/shard/s7/sh/f1bab83a-7fd5-4bf3-9e67-239ea0c71441/98b1103577638734fb9335f755591b82/deep/0/Solr-Admin-(jeanfrancoiscote.egzakt.com).png

搜索“wi-fi”时的查询调试。我不知道为什么它找不到两个文件:
<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="debugQuery">true</str>
<str name="indent">true</str>
<str name="q">wi-fi</str>
<str name="wt">xml</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<int name="id">1869</int>
<str name="route">@sujet_simple?sujet_id=1869&amp;slug=wi-fi</str>
<str name="name">Wi-fi</str>
<str name="text">&lt;p&gt;
Wi-fi&lt;/p&gt;
</str>
<long name="_version_">1493472450933948416</long></doc>
</result>
<lst name="debug">
<str name="rawquerystring">wi-fi</str>
<str name="querystring">wi-fi</str>
<str name="parsedquery">MultiPhraseQuery(text:"(wi-fi wi) (fi wifi)")</str>
<str name="parsedquery_toString">text:"(wi-fi wi) (fi wifi)"</str>
<lst name="explain">
<str name="1869">
30.33298 = (MATCH) weight(text:"(wi-fi wi) (fi wifi)" in 0) [DefaultSimilarity], result of:
30.33298 = score(doc=0,freq=1.0 = phraseFreq=1.0
), product of:
0.99999994 = queryWeight, product of:
30.332981 = idf(), sum of:
7.684612 = idf(docFreq=1, maxDocs=1600)
7.684612 = idf(docFreq=1, maxDocs=1600)
7.684612 = idf(docFreq=1, maxDocs=1600)
7.2791467 = idf(docFreq=2, maxDocs=1600)
0.032967415 = queryNorm
30.332981 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = phraseFreq=1.0
30.332981 = idf(), sum of:
7.684612 = idf(docFreq=1, maxDocs=1600)
7.684612 = idf(docFreq=1, maxDocs=1600)
7.684612 = idf(docFreq=1, maxDocs=1600)
7.2791467 = idf(docFreq=2, maxDocs=1600)
1.0 = fieldNorm(doc=0)
</str>
</lst>
<str name="QParser">LuceneQParser</str>
<lst name="timing">
<double name="time">1.0</double>
<lst name="prepare">
<double name="time">0.0</double>
<lst name="query">
<double name="time">0.0</double>
</lst>
<lst name="facet">
<double name="time">0.0</double>
</lst>
<lst name="mlt">
<double name="time">0.0</double>
</lst>
<lst name="highlight">
<double name="time">0.0</double>
</lst>
<lst name="stats">
<double name="time">0.0</double>
</lst>
<lst name="debug">
<double name="time">0.0</double>
</lst>
</lst>
<lst name="process">
<double name="time">1.0</double>
<lst name="query">
<double name="time">0.0</double>
</lst>
<lst name="facet">
<double name="time">0.0</double>
</lst>
<lst name="mlt">
<double name="time">0.0</double>
</lst>
<lst name="highlight">
<double name="time">0.0</double>
</lst>
<lst name="stats">
<double name="time">0.0</double>
</lst>
<lst name="debug">
<double name="time">1.0</double>
</lst>
</lst>
</lst>
</lst>
</response>

谢谢你的帮助。

最佳答案

您需要调整架构的分析端。 debugQuery=true 和 Solr 分析工具是您查找此类错误的 friend 。

根据您的配置搜索 wifi 会产生以下查询:

wifi
"parsedquery_toString": "text:wifi",

和 wi-fi
wi-fi
"parsedquery_toString": "text:\"(wi-fi wi) (fi wifi)\"",

我们配置的分析端为 wi-fi 生成不匹配的术语。

如果我们在分析端改变过滤器不产生词部分:
  <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />

我们得到以下为 wifi 生成的搜索词组
parsedquery_toString": "text:wifi",

对于无线网络:
"parsedquery_toString": "text:wi-fi text:wifi"

哪些匹配来自分析工具的 wi-fi 和 wifi 的索引术语
wi-fi, wi, fi, wifi
wifi

注意:文本是我们在这个例子中的默认字段

关于带/不带连字符的 Solr 搜索,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28592005/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com