gpt4 book ai didi

java - 为什么Lucene索引这么大?

转载 作者:行者123 更新时间:2023-12-01 20:56:22 24 4
gpt4 key购买 nike

我通过以下方式将文档存储在 Lucene 实例中:

Document doc = new Document();
doc.add(new StringField("title", processor.title, Field.Store.YES));
doc.add(new StringField("annotation", processor.annotation, Field.Store.YES));
doc.add(new TextField("text", processor.text, Field.Store.NO));
w.addDocument(doc);

我不需要将全文存储在索引中,我唯一需要的是能够对文档执行搜索。

问题是我得到的索引的大小几乎与原始文档集的大小相同。这对我来说似乎很奇怪,因为它应该只存储词频。为什么会发生这种情况?

最佳答案

It seems quite strange to me as it should only store word frequencies.

我认为您误解了存储的内容以及存储的方式。 Lucene documentation对于索引文件格式有详细解释。引用概述部分:

Each segment index maintains the following:

  • Field names. This contains the set of field names used in the index.

  • Stored Field values. This contains, for each document, a list of attribute-value pairs, where the attributes are field names. These are used to store auxiliary information about the document, such as its title, url, or an identifier to access a database. The set of stored fields are what is returned for each hit when searching. This is keyed by document number.

  • Term dictionary. A dictionary containing all of the terms used in all of the indexed fields of all of the documents. The dictionary also contains the number of documents which contain the term, and pointers to the term's frequency and proximity data.

  • Term Frequency data. For each term in the dictionary, the numbers of all the documents that contain that term, and the frequency of the term in that document if omitTf is false.

  • Term Proximity data. For each term in the dictionary, the positions that the term occurs in each document. Note that this will not exist if all fields in all documents set omitTf to true.

  • Normalization factors. For each field in each document, a value is stored that is multiplied into the score for hits on that field.

  • Term Vectors. For each field in each document, the term vector (sometimes called document vector) may be stored. A term vector consists of term text and term frequency. To add Term Vectors to your index see the Field constructors

  • Deleted documents. An optional file indicating which documents are deleted.

上面的一些内容是可选的,可能不会出现在您的索引中。但是,最小索引将包含“字段名称”、“存储的字段值”、“术语词典”和“术语频率数据”。

其中一些数据结构根据语料库中不同单词的数量进行扩展。其他则根据文档数量或每个文档的唯一单词数量进行缩放。

如果您使用单个(相对)小的文档填充索引,那么某些缩放因子将对您不利。

最后,索引段的物理表示将被设计和优化,主要是为了快速搜索,而不是减少存储空间。这将影响“信息密度”……以及实际使用的存储空间。

关于java - 为什么Lucene索引这么大?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42308630/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com