gpt4 book ai didi

elasticsearch - 针对结果返回的文档在 Elasticsearch 插件中获取字段值

转载 作者:行者123 更新时间:2023-12-02 22:36:37 25 4
gpt4 key购买 nike

我的要求是基于模糊匹配从elasticsearch中搜索文档,然后通过比较文档的值和输入字符串(例如,如果查询返回3个文档(doc:1,2,3),则为了比较常量值“星球大战”,比较应为:

doc:1, MovieName:"Star Wars" (compare ('Star Wars','Star Wars'))
doc:2, MovieName:"Starr Warz" (compare ('Star Wars','Starr Warz'))
doc:3, MovieName:"The Star Wars" (compare ('Star Wars','The Star Wars'))

我找到了以下Elasticsearch Rescore插件示例,并将其实现以实现上述目的。
https://github.com/elastic/elasticsearch/blob/6.2/plugins/examples/rescore/src/main/java/org/elasticsearch/example/rescore/ExampleRescoreBuilder.java

我可以通过并访问插件中的输入“星球大战”,但是我在获取结果中返回的文档(topdocs)的MovieName字段的值时遇到了麻烦。

我的查询:
  GET movie-idx/_search?
{
"query": {
"bool": {
"must": [
{
"query_string": {
"fields": [
"MovieName"
],
"query": "Star Wars",
"minimum_should_match": "61%",
"fuzziness": 1,
"_name": "fuzzy"
}
}
]
}
},
"rescore": {
"calculateMovieScore": {
"MovieName": "Star Wars"
}
}
}

我的评分者类如下:
private static class DocsRescorer implements Rescorer {
private static final DocsRescorer INSTANCE = new DocsRescorer();

@Override
public TopDocs rescore(TopDocs topDocs, IndexSearcher searcher, RescoreContext rescoreContext) throws IOException {
DocRescoreContext context = (DocRescoreContext) rescoreContext;
int end = Math.min(topDocs.scoreDocs.length, rescoreContext.getWindowSize());

MovieScorer MovieScorer = new MovieScorerBuilder()
.withInputName(context.MovieName)
.build();

for (int i = 0; i < end; i++) {
String name = <get MovieName values from actual document returned by topdocs>
float score = MovieScorer.calculateScore(name);
topDocs.scoreDocs[i].score = score;
}

List<ScoreDoc> scoreDocList = Stream.of(topDocs.scoreDocs).filter((a) -> a.score >= context.threshold).sorted(
(a, b) -> {
if (a.score > b.score) {
return -1;
}
if (a.score < b.score) {
return 1;
}
// Safe because doc ids >= 0
return a.doc - b.doc;
}
).collect(Collectors.toList());
ScoreDoc[] scoreDocs = scoreDocList.toArray(new ScoreDoc[scoreDocList.size()]);
topDocs.scoreDocs = scoreDocs;
return topDocs;
}

@Override
public Explanation explain(int topLevelDocId, IndexSearcher searcher, RescoreContext rescoreContext,
Explanation sourceExplanation) throws IOException {
DocRescoreContext context = (DocRescoreContext) rescoreContext;
// Note that this is inaccurate because it ignores factor field
return Explanation.match(context.factor, "test", singletonList(sourceExplanation));
}

@Override
public void extractTerms(IndexSearcher searcher, RescoreContext rescoreContext, Set<Term> termsSet) {
// Since we don't use queries there are no terms to extract.
}
}

我的理解是,插件代码将执行一次,它将从初始查询(在这种情况下为模糊搜索)的结果中获取topdocs,并且for(int i = 0; i
String name = <get MovieName value from actual document returned by topdocs>

最佳答案

我知道已经超过2年了,但是我遇到了同样的问题并找到了解决方案,所以我将其发布在这里。这是针对ES 7.8.0中的Rescorer插件完成的。我使用的基本示例是分组插件Link
这是一堆我不完全了解的代码,但是主要原理是您需要要获取的字段的IFD(IndexFieldData <?>)实例。在我的示例中,我只需要点击的_id。它看起来像这样:

  • 预先准备IFD并将其传递给RescoreContext:在扩展RescoreContext的类中添加一个成员,以将该IFD保留在上下文中,将其称为“idField”(在第3节中使用)。

  • @Override
    public RescoreContext innerBuildContext(int windowSize, QueryShardContext queryShardContext) throws IOException {
    return new MyRescoreContext(windowSize, queryShardContext.getForField(queryShardContext.fieldMapper("_id")));
    }

  • 接下来,在Rescorer本身中:(method rescore(...))

  • 2.1)首先按scoreDoc.doc排序
     ScoreDoc[] hits = topDocs.scoreDocs; 
    Arrays.sort(hits, Comparator.comparingInt((d) -> d.doc));
    2.2)执行黑魔术(我不明白的代码)
    List<LeafReaderContext> readerContexts = searcher.getIndexReader().leaves();
    int currentReaderIx = -1;
    int currentReaderEndDoc = 0;
    LeafReaderContext currentReaderContext = null;

    for (int i = 0; i < end; i++) {
    ScoreDoc hit = hits[i];

    // find segment that contains current document
    while (hit.doc >= currentReaderEndDoc) {
    currentReaderIx++;
    currentReaderContext = readerContexts.get(currentReaderIx);
    currentReaderEndDoc = currentReaderContext.docBase + currentReaderContext.reader().maxDoc();
    }

    int docId = hit.doc - currentReaderContext.docBase;

    // code from section 3 goes here //
    }
  • 现在,有了这个神奇的“docId”,您可以从For循环内的IFD中获取:
     SortedBinaryDocValues values = rescoreContext.idField.load(currentReaderContext).getBytesValues();
    values.advanceExact(docId);
    String id = values.nextValue().utf8ToString();

  • 根据您的情况,而不是_id字段,请获取所需字段的IFD,然后从For循环内的docId-> string值创建一个Hashmap。
    然后在应用分数的同一For循环中使用此 map 。
    希望这对大家有帮助!完全没有记录该技术,并且在任何地方都没有解释!

    关于elasticsearch - 针对结果返回的文档在 Elasticsearch 插件中获取字段值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49915098/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com