gpt4 book ai didi

lucene - 卢克揭示了索引中数字字段的未知术语值

转载 作者:行者123 更新时间:2023-12-01 01:16:19 25 4
gpt4 key购买 nike

我们使用 Lucene.net 进行索引。我们索引的字段之一是数值字段,其值为 1 到 6,9999 表示未设置。

使用时 Luke为了探索索引,我们会看到我们不认识的术语。该索引共包含 38673 个文档,Luke 显示了该字段的以下排名靠前的术语:

Term | Rank  | Field | Text | Text (decoded as numeric-int)
1 | 38673 | Axis | x | 0
2 | 38673 | Axis | p | 0
3 | 38673 | Axis | t | 0
4 | 38673 | Axis | | | 0
5 | 19421 | Axis | l | 0
6 | 19421 | Axis | h | 0
7 | 19421 | Axis | d@ | 0
8 | 19252 | Axis | ` N | 9999
9 | 19252 | Axis | l | 8192
10 | 19252 | Axis | h ' | 9984
11 | 19252 | Axis | d@ p | 9984
12 | 18209 | Axis | ` | 4
13 | 950 | Axis | ` | 1
14 | 116 | Axis | ` | 5
15 | 102 | Axis | ` | 6
16 | 26 | Axis | ` | 3
17 | 18 | Axis | ` | 2

我们发现其他数字字段的模式相同。

未知值从何而来?

最佳答案

NumericFields 使用 trie 索引结构体。您看到的术语是其中的一部分,但如果您查询它们,则不会返回结果。

尝试使用 Int32.MaxValue 的精确步长为您的 NumericField 建立索引,这些值将消失。

NumericField documentation

... Within Lucene, each numeric value is indexed as a trie structure, where each term is logically assigned to larger and larger pre-defined brackets (which are simply lower-precision representations of the value). The step size between each successive bracket is called the precisionStep, measured in bits. Smaller precisionStep values result in larger number of brackets, which consumes more disk space in the index but may result in faster range search performance. The default value, 4, was selected for a reasonable tradeoff of disk space consumption versus performance. You can use the expert constructor NumericField(String,int,Field.Store,boolean) if you'd like to change the value. Note that you must also specify a congruent value when creating NumericRangeQuery or NumericRangeFilter. For low cardinality fields larger precision steps are good. If the cardinality is < 100, it is fair to use Integer.MAX_VALUE, which produces one term per value. ...



有关 NumericRangeQuery documentation 中可用的精度步骤的更多详细信息:

Good values for precisionStep are depending on usage and data type:

• The default for all data types is 4, which is used, when no precisionStep is given.

• Ideal value in most cases for 64 bit data types (long, double) is 6 or 8.

• Ideal value in most cases for 32 bit data types (int, float) is 4.

• For low cardinality fields larger precision steps are good. If the cardinality is < 100, it is fair to use •Integer.MAX_VALUE (see below).

• Steps ≥64 for long/double and ≥32 for int/float produces one token per value in the index and querying is as slow as a conventional TermRangeQuery. But it can be used to produce fields, that are solely used for sorting (in this case simply use Integer.MAX_VALUE as precisionStep). Using NumericFields for sorting is ideal, because building the field cache is much faster than with text-only numbers. These fields have one term per value and therefore also work with term enumeration for building distinct lists (e.g. facets / preselected values to search for). Sorting is also possible with range query optimized fields using one of the above precisionSteps.



编辑

小样本,由此产生的索引将在 luke 中显示值为 8192、9984、1792 等的术语,但使用将它们包含在查询中的范围不会产生结果:

NumericField number = new NumericField("number", Field.Store.YES, true);
Field regular = new Field("normal", "", Field.Store.YES, Field.Index.ANALYZED);

IndexWriter iw = new IndexWriter(FSDirectory.GetDirectory("C:\\temp\\testnum"), new StandardAnalyzer(), true);

Document doc = new Document();
doc.Add(number);
doc.Add(regular);

number.SetIntValue(1);
regular.SetValue("one");
iw.AddDocument(doc);

number.SetIntValue(2);
regular.SetValue("one");
iw.AddDocument(doc);

number.SetIntValue(13);
regular.SetValue("one");
iw.AddDocument(doc);

number.SetIntValue(2000);
regular.SetValue("one");
iw.AddDocument(doc);

number.SetIntValue(9999);
regular.SetValue("one");
iw.AddDocument(doc);

iw.Commit();

IndexSearcher searcher = new IndexSearcher(iw.GetReader());

NumericRangeQuery rangeQ = NumericRangeQuery.NewIntRange("number", 1, 2, true, true);
var docs = searcher.Search(rangeQ);
Console.WriteLine(docs.Length().ToString()); // prints 2

rangeQ = NumericRangeQuery.NewIntRange("number", 13, 13, true, true);
docs = searcher.Search(rangeQ);
Console.WriteLine(docs.Length().ToString()); // prints 1

rangeQ = NumericRangeQuery.NewIntRange("number", 9000, 9998, true, true);
docs = searcher.Search(rangeQ);
Console.WriteLine(docs.Length().ToString()); // prints 0

Console.ReadLine();

关于lucene - 卢克揭示了索引中数字字段的未知术语值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11883066/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com