gpt4 book ai didi

Lucene 自定义数字字段评分

转载 作者:行者123 更新时间:2023-12-02 20:41:30 27 4
gpt4 key购买 nike

除了在文本内容字段上使用 tf-idf 相似性进行标准术语搜索之外,我还希望根据数字字段的“相似性”进行评分。这种相似性取决于查询中的值和文档中的值之间的距离(例如,高斯分布,m= [用户输入],s= 0.5)

即假设文档代表人,而人文档有两个字段:

  • 描述(全文)
  • 年龄(数字)。

我想找到类似的文档

描述:(x y z)年龄:30

但年龄不是过滤器,而是分数的一部分(对于 30 岁的人,乘数将为 1.0,对于 25 岁的人,乘数为 0.8 等)

这可以通过合理的方式实现吗?

编辑:最后我发现这可以通过使用 CustomScoreQuery 包装 ValueSourceQuery 和 TermQuery 来完成。请参阅下面我的解决方案。

编辑 2:随着 Lucene 版本的快速变化,我只想补充一点,它是在 Lucene 3.0 (Java) 上进行测试的。

最佳答案

好的,这是(有点冗长)作为完整 JUnit 测试的概念验证。尚未测试其对于大型索引的效率,但从我读到的内容来看,它可能在热身后表现良好,前提是有足够的 RAM 可用于缓存数字字段。

  package tests;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.function.CustomScoreQuery;
import org.apache.lucene.search.function.IntFieldSource;
import org.apache.lucene.search.function.ValueSourceQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;

import junit.framework.TestCase;

public class AgeAndContentScoreQueryTest extends TestCase
{
public class AgeAndContentScoreQuery extends CustomScoreQuery
{
protected float peakX;
protected float sigma;

public AgeAndContentScoreQuery(Query subQuery, ValueSourceQuery valSrcQuery, float peakX, float sigma) {
super(subQuery, valSrcQuery);
this.setStrict(true); // do not normalize score values from ValueSourceQuery!
this.peakX = peakX; // age for which the age-relevance is best
this.sigma = sigma;
}

@Override
public float customScore(int doc, float subQueryScore, float valSrcScore){
// subQueryScore is td-idf score from content query
float contentScore = subQueryScore;

// valSrcScore is a value of date-of-birth field, represented as a float
// let's convert age value to gaussian-like age relevance score
float x = (2011 - valSrcScore); // age
float ageScore = (float) Math.exp(-Math.pow(x - peakX, 2) / 2*sigma*sigma);

float finalScore = ageScore * contentScore;

System.out.println("#contentScore: " + contentScore);
System.out.println("#ageValue: " + (int)valSrcScore);
System.out.println("#ageScore: " + ageScore);
System.out.println("#finalScore: " + finalScore);
System.out.println("+++++++++++++++++");

return finalScore;
}
}

protected Directory directory;
protected Analyzer analyzer = new WhitespaceAnalyzer();
protected String fieldNameContent = "content";
protected String fieldNameDOB = "dob";

protected void setUp() throws Exception
{
directory = new RAMDirectory();
analyzer = new WhitespaceAnalyzer();

// indexed documents
String[] contents = {"foo baz1", "foo baz2 baz3", "baz4"};
int[] dobs = {1991, 1981, 1987}; // date of birth

IndexWriter writer = new IndexWriter(directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
for (int i = 0; i < contents.length; i++)
{
Document doc = new Document();
doc.add(new Field(fieldNameContent, contents[i], Field.Store.YES, Field.Index.ANALYZED)); // store & index
doc.add(new NumericField(fieldNameDOB, Field.Store.YES, true).setIntValue(dobs[i])); // store & index
writer.addDocument(doc);
}
writer.close();
}

public void testSearch() throws Exception
{
String inputTextQuery = "foo bar";
float peak = 27.0f;
float sigma = 0.1f;

QueryParser parser = new QueryParser(Version.LUCENE_30, fieldNameContent, analyzer);
Query contentQuery = parser.parse(inputTextQuery);

ValueSourceQuery dobQuery = new ValueSourceQuery( new IntFieldSource(fieldNameDOB) );
// or: FieldScoreQuery dobQuery = new FieldScoreQuery(fieldNameDOB,Type.INT);

CustomScoreQuery finalQuery = new AgeAndContentScoreQuery(contentQuery, dobQuery, peak, sigma);

IndexSearcher searcher = new IndexSearcher(directory);
TopDocs docs = searcher.search(finalQuery, 10);

System.out.println("\nDocuments found:\n");
for(ScoreDoc match : docs.scoreDocs)
{
Document d = searcher.doc(match.doc);
System.out.println("CONTENT: " + d.get(fieldNameContent) );
System.out.println("D.O.B.: " + d.get(fieldNameDOB) );
System.out.println("SCORE: " + match.score );
System.out.println("-----------------");
}
}
}

关于Lucene 自定义数字字段评分,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/5924937/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com