gpt4 book ai didi

java - lucene - 越接近标题开头的术语越重要

转载 作者:搜寻专家 更新时间:2023-10-30 21:29:23 24 4
gpt4 key购买 nike

我了解如何在索引时或查询时提升字段。但是,如何才能提高匹配更接近标题开头的术语的分数?

例子:

Query = "lucene"

Doc1 title = "Lucene: Homepage"
Doc2 title = "I have a question about lucene?"

我希望第一个文档得分更高,因为“lucene”更接近开头(暂时忽略术语 freq)。

我知道如何使用 SpanQuery 来指定术语之间的接近度,但我不确定如何使用有关字段中位置的信息。

我在 Java 中使用 Lucene 4.1。

最佳答案

我会使用 SpanFirstQuery , 它匹配字段开头附近的术语。由于所有跨度查询都依赖于位置,在 lucene 中建立索引时默认启用。

让我们独立测试一下:您只需提供您的 SpanTermQuery以及可以找到该术语的最大位置(在我的示例中是一个)。

SpanTermQuery spanTermQuery = new SpanTermQuery(new Term("title", "lucene"));
SpanFirstQuery spanFirstQuery = new SpanFirstQuery(spanTermQuery, 1);

给定您的两个文档,如果您使用 StandardAnalyzer 对其进行分析,则此查询只会找到第一个标题为“Lucene: Homepage”的文档。

现在我们可以以某种方式将上面的 SpanFirstQuery 与普通文本查询结合起来,让第一个查询只影响分数。您可以使用 BooleanQuery 轻松完成此操作并将跨度查询作为一个 should 子句,如下所示:

Term term = new Term("title", "lucene");
TermQuery termQuery = new TermQuery(term);
SpanFirstQuery spanFirstQuery = new SpanFirstQuery(new SpanTermQuery(term), 1);
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(new BooleanClause(termQuery, BooleanClause.Occur.MUST));
booleanQuery.add(new BooleanClause(spanFirstQuery, BooleanClause.Occur.SHOULD));

可能有不同的方法来实现相同的目标,也许也使用 CustomScoreQuery,或者自定义代码来实现评分,但在我看来这是最简单的方法。

我用来测试它的代码打印以下输出(包括分数)首先执行唯一的 TermQuery,然后是唯一的 SpanFirstQuery,最后是组合的 BooleanQuery :

------ TermQuery --------
Total hits: 2
title: I have a question about lucene - score: 0.26010898
title: Lucene: I have a really hard question about it - score: 0.22295055
------ SpanFirstQuery --------
Total hits: 1
title: Lucene: I have a really hard question about it - score: 0.15764984
------ BooleanQuery: TermQuery (MUST) + SpanFirstQuery (SHOULD) --------
Total hits: 2
title: Lucene: I have a really hard question about it - score: 0.26912516
title: I have a question about lucene - score: 0.09196242

完整代码如下:

public static void main(String[] args) throws Exception {

Directory directory = FSDirectory.open(new File("data"));

index(directory);

IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);

Term term = new Term("title", "lucene");

System.out.println("------ TermQuery --------");
TermQuery termQuery = new TermQuery(term);
search(indexSearcher, termQuery);

System.out.println("------ SpanFirstQuery --------");
SpanFirstQuery spanFirstQuery = new SpanFirstQuery(new SpanTermQuery(term), 1);
search(indexSearcher, spanFirstQuery);

System.out.println("------ BooleanQuery: TermQuery (MUST) + SpanFirstQuery (SHOULD) --------");
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(new BooleanClause(termQuery, BooleanClause.Occur.MUST));
booleanQuery.add(new BooleanClause(spanFirstQuery, BooleanClause.Occur.SHOULD));
search(indexSearcher, booleanQuery);
}

private static void index(Directory directory) throws Exception {
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_41, new StandardAnalyzer(Version.LUCENE_41));

IndexWriter writer = new IndexWriter(directory, config);

FieldType titleFieldType = new FieldType();
titleFieldType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
titleFieldType.setIndexed(true);
titleFieldType.setStored(true);

Document document = new Document();
document.add(new Field("title","I have a question about lucene", titleFieldType));
writer.addDocument(document);

document = new Document();
document.add(new Field("title","Lucene: I have a really hard question about it", titleFieldType));
writer.addDocument(document);

writer.close();
}

private static void search(IndexSearcher indexSearcher, Query query) throws Exception {
TopDocs topDocs = indexSearcher.search(query, 10);

System.out.println("Total hits: " + topDocs.totalHits);

for (ScoreDoc hit : topDocs.scoreDocs) {
Document result = indexSearcher.doc(hit.doc);
for (IndexableField field : result) {
System.out.println(field.name() + ": " + field.stringValue() + " - score: " + hit.score);
}
}
}

关于java - lucene - 越接近标题开头的术语越重要,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15155001/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com