java - PhraseQuery 在 Apache Lucene 7.2.1 中不起作用-6ren

java - PhraseQuery 在 Apache Lucene 7.2.1 中不起作用

转载作者：行者123 更新时间：2023-11-30 06:45:39

我是 Apache Lucene 的新手。我正在使用 Apache Lucene v7.2.1。我需要在一个巨大的文件中进行短语搜索。我首先制作了一个示例代码，以使用 PhraseQuery 在 Lucene 中找出短语搜索功能。但它不起作用。我的代码如下:

public class LuceneExample 
{

  private static final String INDEX_DIR = "myIndexDir";
  // function to create index writer
  private static IndexWriter createWriter() throws IOException
  {
    FSDirectory dir = FSDirectory.open(Paths.get(INDEX_DIR));
    IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
    IndexWriter writer = new IndexWriter(dir, config);
    return writer;
  }
// function to create the index document.
  private static Document createDocument(Integer id, String source, String target)
  {
    Document document = new Document();
    document.add(new StringField("id", id.toString() , Store.YES));
    document.add(new TextField("source", source , Store.YES));
    document.add(new TextField("target", target , Store.YES));
    return document;
  }

  // function to do index search by source
  private static TopDocs searchBySource(String source, IndexSearcher searcher) throws Exception
  {        
      // phrase query build
    PhraseQuery.Builder builder = new PhraseQuery.Builder();
    String[] words = source.split(" ");
    int ii = 0;
    for (String word : words) {
        builder.add(new Term("source", word), ii);
        ii = ii + 1;
    }
    PhraseQuery query = builder.build();
    System.out.println(query);
    // phrase search
    TopDocs hits = searcher.search(query, 10);
    return hits;
  }

  public static void main(String[] args) throws Exception 
  {
    // TODO Auto-generated method stub
    // create index writer
    IndexWriter writer = createWriter();
    //create documents object
    List<Document> documents = new ArrayList<>();

    String src = "Negotiation Skills are focused on resolving differences for the benefit of an individual or a group , or to satisfy various interests.";
    String tgt = "Modified target : Negotiation Skills are focused on resolving differences for the benefit of an individual or a group, or to satisfy various interests.";
    Document d1 = createDocument(1, src, tgt);
    documents.add(d1);

    src = "This point may benefit all of the participating entities, or just a single party, some of them, or all of them.";
    tgt = "Modified target : This point may benefit all of the participating entities, or just a single party, some of them, or all of them.";
    Document d2 = createDocument(2, src, tgt);
    documents.add(d2);

    writer.deleteAll();

    // adding documents to index writer
    writer.addDocuments(documents);
    writer.commit();
    writer.close();

    // for index searching

    Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
    IndexReader reader = DirectoryReader.open(dir);
    IndexSearcher searcher = new IndexSearcher(reader);

    //Search by source
    TopDocs foundDocs = searchBySource("benefit of an individual", searcher);
    System.out.println("Total Results count :: " + foundDocs.totalHits);
  }

}

当我如上所述搜索字符串“benefit of an individual”时。总结果计数为 '0' 。但它存在于 document1 中。如果我能在解决这个问题上得到任何帮助，那就太好了。提前致谢。

最佳答案

让我们从总结开始:

在索引时，您正在使用带有英语停用词的标准分析器
在查询时，您使用自己的分析，没有停用词和特殊字符删除

有一个规则在索引和查询时使用相同的分析链。

这是一个简化且“正确”的查询处理示例:

  // function to do index search by source
  private static TopDocs searchBySource(String source, IndexSearcher searcher) throws Exception {
    // phrase query build
    PhraseQuery.Builder builder = new PhraseQuery.Builder();
    TokenStream tokenStream = new StandardAnalyzer().tokenStream("source", source);
    tokenStream.reset();
    while (tokenStream.incrementToken()) {
      CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
      builder.add(new Term("source", charTermAttribute.toString()));
    }
    tokenStream.end();
    tokenStream.close();
    builder.setSlop(2);
    PhraseQuery query = builder.build();
    System.out.println(query);
    // phrase search
    TopDocs hits = searcher.search(query, 10);
    return hits;
  }

为了简单起见，我们可以从标准分析器中删除停用词，方法是使用带有空停用词列表的构造函数 - 一切都会像您预期的那样简单。您可以阅读有关停用词和短语查询的更多信息 here .

短语查询的所有问题都是从停用词开始的。在引擎盖下，Lucene 保留所有单词的位置，包括特殊索引中的停用词 - 任期职位。在某些情况下，将“目标”和“目标”分开是很有用的。在短语查询的情况下 - 它会尝试考虑术语位置。例如，我们有一个带有停用词“and”的术语“black and white”。在这种情况下，Lucene 索引将有两个词条“black”在位置 1 和“white”在位置 3。朴素的短语查询“black white”不应匹配任何内容，因为它不允许词条位置出现间隙。有两种可能的策略来创建正确的查询:

"black ? white"- 为每个停用词使用特殊标记。这将匹配“黑色和白色”和“黑色或白色”
"black white"~1 - 允许匹配术语位置上的间隙。 “黑色或白色”也是可能的。斜坡 2 和更多“白色和黑色”也是可能的。

为了创建正确的查询，您可以在查询处理中使用以下术语属性:

PositionIncrementAttribute positionIncrementAttribute = tokenStream.getAttribute(PositionIncrementAttribute.class);

我使用 setSlop(2) 来简化代码片段，您可以根据查询长度设置溢出因子或在短语生成器中放置正确的术语位置。我的建议是不要使用停用词，你可以阅读停用词here .

关于java - PhraseQuery 在 Apache Lucene 7.2.1 中不起作用，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48515198/

文章推荐： java - 循环中没有重复的随机数

文章推荐： java - 如何使用 Mockito 模拟 System.getProperty

Lucene 5.2.1 PhraseQuery 已索引但没有位置数据无法运行 PhraseQuery
当我使用 PhraseQuery 对名为“content”的字段进行研究时，出现异常。为了索引这个字段，我使用 org.apache.lucene.document.TextField 类，因为该字
java - 如何使用自定义分析器为多个字段创建 PhraseQuery？
我想解析用户请求“Hello world!”通过我的自定义分析器并使用 PhraseQuery 搜索抛出“标题”、“描述”字段我found我的问题的疯狂解决方案，但它看起来没有优化最佳答案尝试
c# - 简单的 PhraseQuery 找不到任何结果
我正在尝试对我的数据中的 3 个字段实现 Lucene 搜索。它应该按如下方式工作:当字段文本是“我的大白猫”时，当我搜索“大猫”时，它会匹配。根据教程，我添加了 AddToLuceneIndex
c# - 简单的 PhraseQuery 找不到任何结果
我正在尝试对我的数据中的 3 个字段实现 Lucene 搜索。它应该按如下方式工作:当字段文本是“我的大白猫”时，当我搜索“大猫”时，它会匹配。根据教程，我添加了 AddToLuceneIndex
lucene - 如何为 PhraseQuery 搜索设置 Lucene 标准分析器？
我从 Lucene 上的各种教程中得到的印象是，如果我执行以下操作: IndexWriter writer = new IndexWriter(indexPath, new StandardAnaly
java - 使用 PhraseQuery 或 WildcardQuery 无法从有效索引中找到任何结果？
出于某种原因，我无法从 3552 个项目的有效索引中找到任何结果。请参阅下面的代码，然后是我运行程序时的控制台输出。 3552 是索引文档的数量。 /c:/test/stuff.txt 是从文档 5
Elasticsearch:字段 "title"在没有位置数据的情况下被索引；无法运行 PhraseQuery
我在 ElasticSearch 中有一个具有以下映射的索引: mappings: { feed: { properties: { html_url:
java - PhraseQuery 不工作 Lucene 4.5.0
我尝试使用PhraseQuery，但无法从搜索中获得命中。我正在使用 Lucene 4.5.0。我的索引代码 private IndexWriter writer; public LuceneInd
java - PhraseQuery 在 Apache Lucene 7.2.1 中不起作用
我是 Apache Lucene 的新手。我正在使用 Apache Lucene v7.2.1。我需要在一个巨大的文件中进行短语搜索。我首先制作了一个示例代码，以使用 PhraseQuery 在 Lu
java - 如何在 Apache Lucene 中将 PhraseQuery 与 RangeQuery 结合起来？
基本上，我希望能够使用以下查询“短语查询”AND date:[20180101 TO 20181231]来查询索引。我尝试使用 MultiFieldQueryParser，但出现以下错误: 线程“ma

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

java - PhraseQuery 在 Apache Lucene 7.2.1 中不起作用