gpt4 book ai didi

lucene - SpanNotQuery提供意外结果(排除将被忽略)

转载 作者:行者123 更新时间:2023-12-02 22:20:42 25 4
gpt4 key购买 nike

我们在Elasticsearch中使用SpanNotQuery遇到了一些问题。看起来查询的排除部分被忽略了。

为了重现该问题,我创建了一组文档:

  • Fiets kopen
  • fiets lopen
  • 哈里·考彭
  • 哈里·劳恩
  • 哈里·菲茨
  • kopen lopen

  • harrie的SpanTermQuery将产生(3,4,5)

    针对kopen的SpanTermQuery将导致(1、3、6)

    现在,我想将其合并到一个SpanNotQuery中,其中include是'harrie'并排除'kopen'

    我希望结果是(4,5),但它是(3,4,5)。

    我们必须使用SpanQueries,这只是我们遇到的麻烦的一小部分。

    我只用Lucene创建了一个单元测试来显示我们的问题

    public class LuceneTest {

    @Test
    public void test() throws Exception {
    RAMDirectory ram = new RAMDirectory();
    createAndFillIndex(ram);

    DirectoryReader directoryReader = DirectoryReader.open(ram);
    IndexSearcher searcher = new IndexSearcher(directoryReader);

    SpanQuery include = new SpanTermQuery(new Term("dummy", "harrie"));
    SpanQuery exclude = new SpanTermQuery(new Term("dummy", "kopen"));
    Query spanNot = new SpanNotQuery(include, exclude);

    TopDocs search = searcher.search(spanNot, 100);
    for (ScoreDoc scoreDoc : search.scoreDocs) {
    Document result = searcher.doc(scoreDoc.doc);
    String dummy = result.get("dummy");
    System.out.println(scoreDoc.doc + ": " + dummy);
    }

    }

    private void createAndFillIndex(RAMDirectory ram) throws IOException {
    IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_47, new SimpleAnalyzer(Version.LUCENE_47));
    IndexWriter writer = new IndexWriter(ram, conf);

    add(writer, "nul"); //0
    add(writer, "fiets kopen"); //1
    add(writer, "fiets lopen"); //2
    add(writer, "harrie kopen"); //3
    add(writer, "harrie lopen"); //4
    add(writer, "harrie fiets"); //5
    add(writer, "kopen lopen"); //6

    writer.close();
    }

    private void add(IndexWriter writer, String value) throws IOException {
    Document doc = new Document();
    IndexableField f = new TextField("dummy", value, Field.Store.YES);
    doc.add(f);
    writer.addDocument(doc);
    }

    }

    有人知道我们在做什么错吗?

    谢谢!

    最佳答案

    该文档在此处给出了提示。它匹配:

    spans from include which have no overlap with spans from exclude



    我们正在处理跨度,而不是整个文档。不过,简单术语查询的匹配范围只是单个术语。在示例中的三个匹配文档中的每个文档中,匹配范围是 harrie,与它们中的 kopen术语没有任何重叠。

    看一个显示其工作原理的示例可能会更有用。您应该能够将以下片段复制粘贴到示例中(顺便说一句,感谢 MCVE!)。让我们尝试以下查询:

        SpanQuery include = new SpanTermQuery(new Term("dummy", "harrie"));
    SpanQuery exclude = new SpanTermQuery(new Term("dummy", "kopen"));
    SpanQuery matchterm = new SpanTermQuery(new Term("dummy", "match"));

    SpanQuery[] clauses = {include, matchterm};

    SpanQuery nearQuery = new SpanNearQuery(clauses, 2, true);

    Query spanNot = new SpanNotQuery(nearQuery, exclude);

    针对这些文件:

        add(writer, "harrie kopen match"); //1
    add(writer, "harrie match kopen"); //2
    add(writer, "harrie other stuff match kopen"); //3

    您应该会看到2次点击。
  • 文档1:将nearQuery与跨度匹配:“harrie kopen match”。它包含“kopen”(即与匹配exclude的跨度重叠),因此被SpanNotQuery
  • 消除了
  • 文档2:将nearQuery与跨度匹配:“harrie match”。该文档包含“kopen”,但不在匹配范围内,因此该文档保持匹配状态。
  • 文档3:将nearQuery与以下范围匹配:“与其他东西匹配”。同样,文档包含“kopen”,但不在匹配范围内,因此它可以通过。

  • 如果您希望否定遍及整个文档,而不仅仅是匹配的跨度,请改用 BooleanQuery

    SpanQuery include = new SpanTermQuery(new Term("dummy", "harrie"));
    SpanQuery exclude = new SpanTermQuery(new Term("dummy", "kopen"));
    Query query = new BooleanQuery();
    query.add(new BooleanClause(include, BooleanClause.Occur.MUST))
    query.add(new BooleanClause(exclude, BooleanClause.Occur.MUST_NOT))

    关于lucene - SpanNotQuery提供意外结果(排除将被忽略),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24260103/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com