gpt4 book ai didi

java - 在lucene中计算匹配字符串百分比

转载 作者:行者123 更新时间:2023-11-30 06:57:20 25 4
gpt4 key购买 nike

我在我的 java 应用程序中使用了 Lucene 算法从索引中找出匹配的字符串。我已经从所有命中中取出前 5 个顶级文档,但我想检查或计算原始字符串和匹配字符串的匹配存在。在 Lucene 中有可能吗?Lucene有什么方法可以找出来吗?例如:-

    original string = I am good.

matching string = am good.

% of matching = 95

最佳答案

当你说匹配百分比时,你是什么意思?如果您想知道结果文档中包含多少个原始文本单词(例如,在您的案例中是 3 个单词中的 2 个单词)那么你可以使用 term vectors要完成工作,获取字段和文档的术语 vector 并迭代术语并查看您要查找的内容中是否有术语。或者甚至您可以存储字符串并获取全部内容并进行数学计算(如果存储不是问题)。当前的 lucene 使用 vector space model (将从版本 6x 更改为 BM25)用于计算分数并通过 ScroeDoc 为您提供匹配分数但是 score doc 给出了十进制值,如果足够则使用它。

如果这不能回答问题,请提供有关如何使用样本进行计算的更多详细信息。

希望这对您有所帮助。

PS,我写了一个简单的脚本,所以你可以根据自己的需要查看和修改它:

package org.query;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.WhitespaceAnalyzer;
import org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.BytesRef;
import org.junit.Before;
import org.junit.Test;

import java.util.HashSet;
import java.util.Set;

/**
* Created by ekamolid on 11/2/2015.
*/
public class LevenshteinTest {
private RAMDirectory directory;
private IndexSearcher searcher;
private IndexReader reader;
private Analyzer analyzer;

@Before
public void setUp() throws Exception {
directory = new RAMDirectory();

analyzer = new WhitespaceAnalyzer();
IndexWriter writer = new IndexWriter(directory, new IndexWriterConfig(analyzer));

Document doc = new Document();
FieldType fieldType = new FieldType();
fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
fieldType.setStoreTermVectors(true);
doc.add(new Field("f", "the quick brown fox jumps over the lazy dog", fieldType));
writer.addDocument(doc);

doc = new Document();
doc.add(new Field("f", "the quick red fox jumps over the sleepy cat", fieldType));
writer.addDocument(doc);

doc = new Document();
doc.add(new Field("f", "quiick caar went xyztz dog", fieldType));
writer.addDocument(doc);

writer.close();

reader = DirectoryReader.open(directory);
searcher = new IndexSearcher(reader);
}

public static int distance(String a, String b) { //code is taken from http://rosettacode.org/wiki/Levenshtein_distance#Java
a = a.toLowerCase();
b = b.toLowerCase();
// i == 0
int[] costs = new int[b.length() + 1];
for (int j = 0; j < costs.length; j++)
costs[j] = j;
for (int i = 1; i <= a.length(); i++) {
// j == 0; nw = lev(i - 1, j)
costs[0] = i;
int nw = i - 1;
for (int j = 1; j <= b.length(); j++) {
int cj = Math.min(1 + Math.min(costs[j], costs[j - 1]), a.charAt(i - 1) == b.charAt(j - 1) ? nw : nw + 1);
nw = costs[j];
costs[j] = cj;
}
}
return costs[b.length()];
}


@Test
public void test1() throws Exception {
String s = "quick caar dog";
TokenStream tokenStream = analyzer.tokenStream("abc", s);
TermToBytesRefAttribute termAttribute = tokenStream.getAttribute(TermToBytesRefAttribute.class);
Set<String> stringSet = new HashSet<>();
tokenStream.reset();
BooleanQuery.Builder builder = new BooleanQuery.Builder();
while (tokenStream.incrementToken()) {
stringSet.add(termAttribute.getBytesRef().utf8ToString());
Query query = new FuzzyQuery(new Term("f", termAttribute.getBytesRef().utf8ToString()), 2); //search only 2 edits
builder.add(query, BooleanClause.Occur.SHOULD);
}
TopDocs hits = searcher.search(builder.build(), 10);
int exactMatch = 0;
int match1 = 0;
int match2 = 0;
for (ScoreDoc scoreDoc : hits.scoreDocs) {
exactMatch = match1 = match2 = 0;
Terms terms = reader.getTermVector(scoreDoc.doc, "f");
TermsEnum termsEnum = terms.iterator();
while (true) {
BytesRef bytesRef = termsEnum.next();
if (bytesRef == null) {
break;
}
String str = bytesRef.utf8ToString();
if (stringSet.contains(str)) {
exactMatch++;
continue;
}
for (String s1 : stringSet) {
int distance = distance(s1, str);
if (distance <= 1) {
match1++;
} else if (distance <= 2) {
match2++;
}
}
}
System.out.print(" doc=" + scoreDoc.doc);
System.out.print(" exactMatch=" + exactMatch);
System.out.print(" match1=" + match1);
System.out.println(" match2=" + match1);
}
}
}

我得到的输出是:

 doc=2 exactMatch=2 match1=1 match2=1
doc=1 exactMatch=1 match1=0 match2=0
doc=0 exactMatch=2 match1=0 match2=0

这是工作代码,它告诉我们有多少字符是完全匹配的,有多少是 1 个字符差异和 2 个字符差异。所以你可以把你的登录名放在那里,根据你手头的数字计算百分比。这可能会慢一些,因为您正在遍历文档,但您应该将结果限制为特定数字(示例中的 10),这样它就不会慢了。

关于java - 在lucene中计算匹配字符串百分比,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33522280/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com