gpt4 book ai didi

java - 为文档添加权重 Lucene 8

转载 作者:行者123 更新时间:2023-11-30 01:44:22 24 4
gpt4 key购买 nike

我目前正在使用 Lucene 8 开发一个小型大学搜索引擎。我之前已经构建过它,但没有对文档应用任何权重。

我现在需要添加文档的 PageRank 作为每个文档的权重,并且我已经计算了 PageRank 值。如何在 Lucene 8 中向 Document 对象(不是查询词)添加权重?我在网上查了很多解决方案,但它们只适用于旧版本的Lucene。 Example source

这是我的(更新)代码,它从 File 对象生成 Document 对象:

public static Document getDocument(File f) throws FileNotFoundException, IOException {
Document d = new Document();

//adding a field
FieldType contentType = new FieldType();
contentType.setStored(true);
contentType.setTokenized(true);
contentType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
contentType.setStoreTermVectors(true);

String fileContents = String.join(" ", Files.readAllLines(f.toPath(), StandardCharsets.UTF_8));
d.add(new Field("content", fileContents, contentType));

//adding other fields, then...

//the boost coefficient (updated):
double coef = 1.0 + ranks.get(path);
d.add(new DoubleDocValuesField("boost", coef));

return d;

}

我当前方法的问题是我需要一个 CustomScoreQuery 对象来搜索文档,但这在 Lucene 8 中不可用。另外,我现在不想降级到 Lucene 7毕竟是我在Lucene 8中编写的代码。

<小时/>

编辑:

经过一些(冗长的)研究,我向每个持有提升的文档添加了一个 DoubleDocValuesField (请参阅上面更新的代码),并按照建议使用 FunctionScoreQuery 进行搜索@埃里克拉沃。但是,现在我的所有文档的分数都与它们的提升完全一致,无论查询如何!我该如何解决这个问题?这是我的搜索功能:

public static TopDocs search(String query, IndexSearcher searcher, String outputFile) {
try {
Query q_temp = buildQuery(query); //the original query, was working fine alone

Query q = new FunctionScoreQuery(q_temp, DoubleValuesSource.fromDoubleField("boost")); //the new query
q = q.rewrite(DirectoryReader.open(bm25IndexDir));
TopDocs results = searcher.search(q, 10);

ScoreDoc[] filterScoreDosArray = results.scoreDocs;
for (int i = 0; i < filterScoreDosArray.length; ++i) {
int docId = filterScoreDosArray[i].doc;
Document d = searcher.doc(docId);

//here, when printing, I see that the document's score is the same as its "boost" value. WHY??
System.out.println((i + 1) + ". " + d.get("path")+" Score: "+ filterScoreDosArray[i].score);
}

return results;
}
catch(Exception e) {
e.printStackTrace();
return null;
}
}

//function that builds the query, working fine
public static Query buildQuery(String query) {
try {
PhraseQuery.Builder builder = new PhraseQuery.Builder();
TokenStream tokenStream = new EnglishAnalyzer().tokenStream("content", query);
tokenStream.reset();

while (tokenStream.incrementToken()) {
CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
builder.add(new Term("content", charTermAttribute.toString()));
}

tokenStream.end(); tokenStream.close();
builder.setSlop(1000);
PhraseQuery q = builder.build();

return q;
}
catch(Exception e) {
e.printStackTrace();
return null;
}
}

最佳答案

Lucene 6.5.0开始:

Index-time boosts are deprecated. As a replacement, index-time scoring factors should be indexed into a doc value field and combined at query time using eg. FunctionScoreQuery. (Adrien Grand)

建议不要使用索引时间提升,而是将评分因素(即长度标准化因素)编码到文档值字段中。 (参见LUCENE-6819)

关于java - 为文档添加权重 Lucene 8,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58701267/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com