gpt4 book ai didi

java - Lucene(version5.3)生成的索引如何获取term frequency和doc frequency

转载 作者:行者123 更新时间:2023-11-30 10:52:23 25 4
gpt4 key购买 nike

我正在尝试从 Lucene(5.3) 生成的索引文件中获取术语频率和文档频率。实现如下图:

private static void showIndex(String iNDEX_DIR2) throws IOException {
// TODO Auto-generated method stub
System.out.println("INDEX_DIR:" + iNDEX_DIR2);
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(iNDEX_DIR2).toPath()));

int num_doc = reader.numDocs();
System.out.println("number of docs: "+String.valueOf(num_doc));
for(int docNum=0; docNum<num_doc; docNum++){
Document doc = reader.document(docNum);
System.out.println("Processing file:"+doc.get("id"));

System.out.println("doc is null? "+ String.valueOf(doc==null));
Terms termVector = reader.getTermVector(docNum, "content");
TermsEnum itr = termVector.iterator();
BytesRef term = null;

while((term = itr.next()) != null){
try{
String termText = term.utf8ToString();
Term termInstance = new Term("contents",term);
long termFreq = reader.totalTermFreq(termInstance);
long docCount = reader.docFreq(termInstance);

System.out.println("term: "+termText+", termFreq = "+termFreq+", docCount = "+docCount);
}catch(Exception e){
System.out.println(e);
}
}
}
}

当我运行代码片段时,我得到了信息:

INDEX_DIR:F:\Information Retrieval\project\TEST\INDEX
number of docs: 4
Processing file:null
doc is null? false
Exception in thread "main" java.lang.NullPointerException
at IndexManager.showIndex

但是,它显示该文档不为空。

有人可以帮我解决这个问题吗?非常感谢!

最佳答案

我猜 NPE 是在:

TermsEnum itr = termVector.iterator();

IndexReader.getTermVector如果该字段未使用 TermVectors 存储,则返回 null,TextField ,例如,不是。

你可以在FieldType中设置一个字段来存储TermVectors。如果您需要带有 TermVectors 的 TextField,您可以将 TextField 的 FieldType 传递给 FieldType 构造函数以创建它的可变副本,例如:

FieldType myFieldType = new FieldType(TextField.TYPE_STORED);
myFieldType.setStoreTermVectors(true);

doc.add(new Field("contents", fieldContents, myFieldType));

关于java - Lucene(version5.3)生成的索引如何获取term frequency和doc frequency,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34368711/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com