gpt4 book ai didi

search - Solr 1.4 和 EdgeNGrams 的奇怪结果 - 有些子串匹配,有些不匹配

转载 作者:行者123 更新时间:2023-12-01 04:09:46 26 4
gpt4 key购买 nike

编辑 3 :我现在使用的解决方法是从我的查询和索引字段中去除除字母、数字和空格以外的任何内容。这会产生所需的行为,但它在很大程度上是一种解决方法而不是真正的解决方案,而且我仍然想了解 Solr 为什么正在做它正在做的事情......所以仍然对答案感兴趣,如果有人有答案的话。 结束编辑 3

我有一个名为“TT-14B”的文档,由 Solr 1.4(通过 Django/Haystack)索引。当我查询 content_auto “tt-1”或“tt 14”或“tt 14b”字段我取回文件;当我查询“tt-14”或“tt-14b”时,我没有得到任何结果。我对 Haystack 生成的 Solr 架构进行了一些编辑以尝试解决此问题,但无济于事。使用analyze.jsp,在我看来我应该匹配“tt-14”;我当然应该为“tt-14b”买一个。 ( 编辑 :哦,将默认运算符从 AND 更改为 OR 无济于事。)

有人可以帮助我理解为什么这不起作用吗?谢谢。

...

结果

QUERY  | WORKS
=======|======
tt | yes
tt- | yes
tt-1 | yes
tt-14 | no
tt-14b | no
tt 14 | yes
tt 14b | yes

编辑 2

得到了一些比较奇怪的结果,可能有助于调试问题。在这种情况下,测试文档是“abc'def”。
QUERY   | WORKS
========|======
abc | yes
abc'd | yes
abc'de | no
abc'def | no

显然,相同的模式,但我不明白是什么原因造成的。

结束编辑 2

schema.xml 相关部分(完整文件如下)
<fieldType name="edge_ngram" class="solr.TextField" positionIncrementGap="1">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" splitOnNumerics="0" preserveOriginal="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="15" side="front" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" splitOnNumerics="0" preserveOriginal="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
</analyzer>
</fieldType>

schema.xml(完整)
<?xml version="1.0" ?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

<schema name="default" version="1.1">
<types>
<fieldtype name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/>

<!-- Numeric field types that manipulate the value into
a string value that isn't human-readable in its internal form,
but with a lexicographic ordering the same as the numeric ordering,
so that range queries work correctly. -->
<fieldType name="sint" class="solr.SortableIntField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="slong" class="solr.SortableLongField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="sfloat" class="solr.SortableFloatField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="sdouble" class="solr.SortableDoubleField" sortMissingLast="true" omitNorms="true"/>

<fieldType name="date" class="solr.DateField" sortMissingLast="true" omitNorms="true"/>

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" splitOnNumerics="0" preserveOriginal="1" catenateWords="1" catenateNumbers="1" catenateAll="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" splitOnNumerics="0" preserveOriginal="1" catenateWords="0" catenateNumbers="0" catenateAll="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>

<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>

<fieldType name="ngram" class="solr.TextField" >
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="15" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

<fieldType name="edge_ngram" class="solr.TextField" positionIncrementGap="1">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" splitOnNumerics="0" preserveOriginal="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="15" side="front" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" splitOnNumerics="0" preserveOriginal="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
</analyzer>
</fieldType>
</types>

<fields>
<!-- general -->
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/>
<field name="django_ct" type="string" indexed="true" stored="true" multiValued="false" />
<field name="django_id" type="string" indexed="true" stored="true" multiValued="false" />

<dynamicField name="*_i" type="sint" indexed="true" stored="true"/>
<dynamicField name="*_s" type="string" indexed="true" stored="true"/>
<dynamicField name="*_l" type="slong" indexed="true" stored="true"/>
<dynamicField name="*_t" type="text" indexed="true" stored="true"/>
<dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
<dynamicField name="*_f" type="sfloat" indexed="true" stored="true"/>
<dynamicField name="*_d" type="sdouble" indexed="true" stored="true"/>
<dynamicField name="*_dt" type="date" indexed="true" stored="true"/>


<field name="modelname_exact" type="string" indexed="true" stored="true" multiValued="false" />

<field name="modelname" type="text" indexed="true" stored="true" multiValued="false" />

<field name="name" type="text" indexed="true" stored="true" multiValued="false" />

<field name="text" type="text" indexed="true" stored="true" multiValued="false" />

<field name="name_exact" type="string" indexed="true" stored="true" multiValued="false" />

<field name="content_auto" type="edge_ngram" indexed="true" stored="true" multiValued="true" />

</fields>

<!-- field to use to determine and enforce document uniqueness. -->
<uniqueKey>id</uniqueKey>

<!-- field for the QueryParser to use when an explicit fieldname is absent -->
<defaultSearchField>text</defaultSearchField>

<!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
<solrQueryParser defaultOperator="AND" />
</schema>

最佳答案

每个案例的/admin/analysis.jsp 的屏幕截图会很有趣。

有没有原因,为什么positionIncrementGap="1"设置为1?
tt-14btt 14b由于 whitspace 标记器,处理方式不同。
这意味着:tt-14b 是一项,只要 WordDelimiterFilterFactory不会触发,而 tt 14b从一开始是2个术语。positionIncrementGap即使没有邻居,但在下一个“n”位置,您也可以将不同的术语视为一个短语。所以尽量升positionIncrementGap .

顺便说一句:我在您的 schema.xml 上注意到的第一个是查询时缺少的“EdgeNGramFilterFactory”。哪个应该没问题。但是也有可以理解的原因,为什么将“查询和索引时间的相同过滤器”作为最佳实践来处理。
这取决于每种特殊情况,但在查询时激活此过滤器将是一个尝试。

关于search - Solr 1.4 和 EdgeNGrams 的奇怪结果 - 有些子串匹配,有些不匹配,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6938226/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com