gpt4 book ai didi

lucene - 获取lucene中两个文档之间的余弦相似度

转载 作者:行者123 更新时间:2023-12-02 23:14:13 28 4
gpt4 key购买 nike

我已经在 Lucene 中建立了一个索引。我想在不指定查询的情况下,只是为了获得索引中两个文档之间的分数(余弦相似度或另一个距离?)。

例如,我从之前打开的 IndexReader 中获取 id 为 2 和 4 的文档。文档 d1 = ir.document(2);文档 d2 = ir.document(4);

如何获得这两个文档之间的余弦相似度?

谢谢

最佳答案

正如 Julia 指出的那样Sujit Pal's example非常有用但 Lucene 4 API 有了实质性的改变。这是为 Lucene 4 重写的版本。

import java.io.IOException;
import java.util.*;

import org.apache.commons.math3.linear.*;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.SimpleAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.*;
import org.apache.lucene.store.*;
import org.apache.lucene.util.*;

public class CosineDocumentSimilarity {

public static final String CONTENT = "Content";

private final Set<String> terms = new HashSet<>();
private final RealVector v1;
private final RealVector v2;

CosineDocumentSimilarity(String s1, String s2) throws IOException {
Directory directory = createIndex(s1, s2);
IndexReader reader = DirectoryReader.open(directory);
Map<String, Integer> f1 = getTermFrequencies(reader, 0);
Map<String, Integer> f2 = getTermFrequencies(reader, 1);
reader.close();
v1 = toRealVector(f1);
v2 = toRealVector(f2);
}

Directory createIndex(String s1, String s2) throws IOException {
Directory directory = new RAMDirectory();
Analyzer analyzer = new SimpleAnalyzer(Version.LUCENE_CURRENT);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_CURRENT,
analyzer);
IndexWriter writer = new IndexWriter(directory, iwc);
addDocument(writer, s1);
addDocument(writer, s2);
writer.close();
return directory;
}

/* Indexed, tokenized, stored. */
public static final FieldType TYPE_STORED = new FieldType();

static {
TYPE_STORED.setIndexed(true);
TYPE_STORED.setTokenized(true);
TYPE_STORED.setStored(true);
TYPE_STORED.setStoreTermVectors(true);
TYPE_STORED.setStoreTermVectorPositions(true);
TYPE_STORED.freeze();
}

void addDocument(IndexWriter writer, String content) throws IOException {
Document doc = new Document();
Field field = new Field(CONTENT, content, TYPE_STORED);
doc.add(field);
writer.addDocument(doc);
}

double getCosineSimilarity() {
return (v1.dotProduct(v2)) / (v1.getNorm() * v2.getNorm());
}

public static double getCosineSimilarity(String s1, String s2)
throws IOException {
return new CosineDocumentSimilarity(s1, s2).getCosineSimilarity();
}

Map<String, Integer> getTermFrequencies(IndexReader reader, int docId)
throws IOException {
Terms vector = reader.getTermVector(docId, CONTENT);
TermsEnum termsEnum = null;
termsEnum = vector.iterator(termsEnum);
Map<String, Integer> frequencies = new HashMap<>();
BytesRef text = null;
while ((text = termsEnum.next()) != null) {
String term = text.utf8ToString();
int freq = (int) termsEnum.totalTermFreq();
frequencies.put(term, freq);
terms.add(term);
}
return frequencies;
}

RealVector toRealVector(Map<String, Integer> map) {
RealVector vector = new ArrayRealVector(terms.size());
int i = 0;
for (String term : terms) {
int value = map.containsKey(term) ? map.get(term) : 0;
vector.setEntry(i++, value);
}
return (RealVector) vector.mapDivide(vector.getL1Norm());
}
}

关于lucene - 获取lucene中两个文档之间的余弦相似度,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/1844194/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com