java Lucene最佳匹配不是精确匹配-6ren

java Lucene最佳匹配不是精确匹配

转载作者：塔克拉玛干更新时间：2023-11-02 08:44:17

Lucene 评分似乎完全无法理解。

我有一组文档用于以下内容:

Senior Education Recruitment Consultant
Senior IT Recruitment Consultant
Senior Recruitment Consultant

这些已使用 EnglishAnalyzer 进行分析。

搜索查询是使用 QueryParser 构建的，同时还使用了 EnglishAnalyzer。

当我搜索 Senior Recruitment Consultant 时，上述所有文档都以相同的分数返回，其中期望(和预期)的结果将是 Senior Recruitment Consultant作为最佳结果。

是否有一种直接的方法可以实现我错过的所需行为？

这是我的调试输出:

4.6491017 = (MATCH) sum of:
  1.1064172 = (MATCH) weight(Title:senior in 22157) [DefaultSimilarity], result of:
    1.1064172 = score(doc=22157,freq=1.0 = termFreq=1.0
), product of:
      0.4878372 = queryWeight, product of:
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.10754765 = queryNorm
      2.268005 = fieldWeight in 22157, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.5 = fieldNorm(doc=22157)
  2.3421772 = (MATCH) weight(Title:recruit in 22157) [DefaultSimilarity], result of:
    2.3421772 = score(doc=22157,freq=1.0 = termFreq=1.0
), product of:
      0.70978254 = queryWeight, product of:
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.10754765 = queryNorm
      3.2998517 = fieldWeight in 22157, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.5 = fieldNorm(doc=22157)
  1.2005073 = (MATCH) weight(Title:consult in 22157) [DefaultSimilarity], result of:
    1.2005073 = score(doc=22157,freq=1.0 = termFreq=1.0
), product of:
      0.50815696 = queryWeight, product of:
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.10754765 = queryNorm
      2.3624735 = fieldWeight in 22157, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.5 = fieldNorm(doc=22157)

4.6491017 = (MATCH) sum of:
  1.1064172 = (MATCH) weight(Title:senior in 22292) [DefaultSimilarity], result of:
    1.1064172 = score(doc=22292,freq=1.0 = termFreq=1.0
), product of:
      0.4878372 = queryWeight, product of:
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.10754765 = queryNorm
      2.268005 = fieldWeight in 22292, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.5 = fieldNorm(doc=22292)
  2.3421772 = (MATCH) weight(Title:recruit in 22292) [DefaultSimilarity], result of:
    2.3421772 = score(doc=22292,freq=1.0 = termFreq=1.0
), product of:
      0.70978254 = queryWeight, product of:
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.10754765 = queryNorm
      3.2998517 = fieldWeight in 22292, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.5 = fieldNorm(doc=22292)
  1.2005073 = (MATCH) weight(Title:consult in 22292) [DefaultSimilarity], result of:
    1.2005073 = score(doc=22292,freq=1.0 = termFreq=1.0
), product of:
      0.50815696 = queryWeight, product of:
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.10754765 = queryNorm
      2.3624735 = fieldWeight in 22292, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.5 = fieldNorm(doc=22292)

4.6491017 = (MATCH) sum of:
  1.1064172 = (MATCH) weight(Title:senior in 22494) [DefaultSimilarity], result of:
    1.1064172 = score(doc=22494,freq=1.0 = termFreq=1.0
), product of:
      0.4878372 = queryWeight, product of:
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.10754765 = queryNorm
      2.268005 = fieldWeight in 22494, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.5 = fieldNorm(doc=22494)
  2.3421772 = (MATCH) weight(Title:recruit in 22494) [DefaultSimilarity], result of:
    2.3421772 = score(doc=22494,freq=1.0 = termFreq=1.0
), product of:
      0.70978254 = queryWeight, product of:
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.10754765 = queryNorm
      3.2998517 = fieldWeight in 22494, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.5 = fieldNorm(doc=22494)
  1.2005073 = (MATCH) weight(Title:consult in 22494) [DefaultSimilarity], result of:
    1.2005073 = score(doc=22494,freq=1.0 = termFreq=1.0
), product of:
      0.50815696 = queryWeight, product of:
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.10754765 = queryNorm
      2.3624735 = fieldWeight in 22494, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.5 = fieldNorm(doc=22494)


Senior Education Recruitment Consultant 4.6491017
Senior IT Recruitment Consultant 4.6491017
Senior Recruitment Consultant 4.6491017

最佳答案

您必须依赖的唯一评分元素是长度范数。

Lengthnorm 在索引时间与字段的提升一起存储在文档中。它有助于为较短的文档打分。

为什么它不起作用？你有两个问题:

首先:规范以极其有损的压缩方式存储。它们仅占用一个字节，并且具有大约 1 位有效小数位的精度。所以，基本上，差异还不足以影响分数。

关于这种损失的基本原理，来自 DefaultSimilarity documentation :

...given the difficulty (and inaccuracy) of users to express their true information need by a query, only big differences matter.

其次:“IT”在英语中是停用词。你的意思是“信息技术”，但分析器看到的只是普通的英语代词。无论您在该字段中放入多少停用词，它们都不会影响长度范数。

这是一个显示我想出的一些结果的测试:

Senior Education Recruitment Consultant ::: 0.732527
Senior IT Recruitment Consultant ::: 0.732527
Senior Recruitment Consultant ::: 0.732527
if and but Senior IT IT IT IT IT Recruitment this and that Consultant ::: 0.732527
Senior Education Recruitment Consultant Of Justice ::: 0.64096117
Senior Recruitment Consultant and some other nonsense we don't want to know about ::: 0.3662635

如您所见，对于“司法高级教育招聘顾问”，我们只添加了一个搜索词，lengthnorm 就开始发挥作用了。但是对于“if and but Senior IT IT IT IT IT Recruitment this that Consultant”仍然看不出有什么区别，因为所有添加的术语都是常见的英语停用词。

解决方案:您可以通过自定义相似性实现解决规范精度问题，该实现不会那么难以编码(复制DefaultSimilarity，并实现无损encodeNormValue 和 decodeNormValue)。您还可以使用自定义或空停用词列表(通过 EnglishAnalyzer ctor )设置分析器。

但是，这可能会把婴儿连同洗澡水一起倒掉。如果精确匹配获得更高的分数真的很重要，那么在查询中表达这一点可能会更好，如下所示:

\"Senior Recruitment Consultant\" Senior Recruitment Consultant

结果:

Senior Recruitment Consultant ::: 1.465054
Senior Recruitment Consultant and some other nonsense we don't want to know about ::: 0.732527
Senior Education Recruitment Consultant ::: 0.27469763
Senior IT Recruitment Consultant ::: 0.27469763
if and but Senior IT IT IT IT IT Recruitment this and that Consultant ::: 0.27469763
Senior Education Recruitment Consultant Of Justice ::: 0.24036042

关于java Lucene最佳匹配不是精确匹配，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/29541678/

文章推荐： java - Spring AMQP : MessageListener not receiving any messages

文章推荐： ios - UITableView 需要看起来像 Contacts with edit in place fields

r - 转换错误 - .subset2(x, i, 精确 = 精确) 中的错误
感觉我在这里遗漏了一些明显的东西，所以提前道歉。无论如何，这是我尝试转换的一些数据a: acct_num year_prem prem exc 001 20
c# - 精确/文字单词或模式匹配正则表达式
我正在尝试将表中的模式与用户话语匹配。 string userUtterance = "I want identification number for number of customers";
linux - gccgo 精确
当尝试在 Precise 上链接 gccgo 时，出现此链接错误: matt@matt-1005P:~/src/gopath/src/meme$ gccgo cmd/meme/main.go -o m
matlab - 样条插值及其(精确)导数
假设我有以下数据和命令: clc;clear; t = [0:0.1:1]; t_new = [0:0.01:1]; y = [1,2,1,3,2,2,4,5,6,1,0]; p = interp1(
matlab - 样条插值及其(精确)导数
假设我有以下数据和命令: clc;clear; t = [0:0.1:1]; t_new = [0:0.01:1]; y = [1,2,1,3,2,2,4,5,6,1,0]; p = interp1(
java - 精确 PrefixQuery 得分更高
我总是想给精确匹配比只匹配前缀的分数更高的分数(例如，“ball”在与“ball*”匹配时应该比“ballistic”得到更高的分数)。我当前(详细)的方法是在创建 PrefixQuery 时始终执
Android MediaPlayer seekTo 精确
有什么解决方法可以让我在 Android 中使用 long 或 double 来寻找音频文件中的位置吗？目前 seekTo 只接受 ints 参数。我想更精确(比如在十分之一秒内) int resID
Swift 3 replacingOccurrences 精确
我的 replacingOccurrences 函数有问题。我有一个这样的字符串: let x = "john, johnny, johnney" 我需要做的只是删除“john” 所以我有这段代码:
python - 精确标签值时出错 - BeautifulSoup
我正在使用 BeautifulSoup 进行网页抓取。我有这段代码来提取 a 标签的值，但它似乎不起作用。显示错误: AttributeError: 'int' object has no attri
algorithm - 精确(纠错)图匹配算法
我要在带有标记顶点和标记有向边的图上寻找一种不精确的图匹配算法。我的任务是检测两个图表的变化以将它们显示给开发人员(想想颠覆差异)。我已经实现了基于禁忌搜索 ( this ) 的优化算法，但我无法让该
apache - .htaccess 精确 url 重定向
我有两个网站: example.com 和 yyy.com 他们都有类似的网络应用程序，但在不同的服务器上。我想让 Apache 将所有路径请求重定向到 example.com 与完全相同的方式yy
php - MySQL 精确 URL 搜索
因此，我尝试合并两个公司信息数据库(从现在起表 A 和表 B)，其中最常见(且可靠)的单一引用点是网站 URL。表 A 已更新，表 B 待更新。我已经从表 A 中提取了 URL，并使用 PHP 清理
javascript - 为什么距离的这种指数衰减会导致 99 [精确] 的一次性误差？
我正在 http://classicorthodoxbible.com/new.html 上制作效果主要描述中的 Angular 色，包裹在自己的跨度中，从他们通常的休息地点移动到随机位置，然后通过指
用于实时音频合成的 C++ 精确 44100Hz 时钟
我目前正在使用我的 Raspberry Pi 及其内置 UART 输入编写 MIDI 合成器。在某个时间点，为了启用 MIDI 输入的实时回放，我必须设置一种环形缓冲区以与 OpenAL 一起使用，
c - 使 C float 精确？
在 C 中，当设置了一个 float 时， int main(int argc, char *argv[]) { float temp = 98.6f; printf("%f\n",
ios - 循环 MP3 精确 iOS
实现 MP3 无间隙循环的最佳可能性是什么？目前我正在使用 AVAudioPlayer 并将 .numberOfLoops() 属性设置为 -1 但可以听到，轨道重新启动。情况并非如此，例如使用 Tr
r - "matrix-like?"的(精确)含义是什么
我想创建不一定是“正确”矩阵的“类矩阵”对象。但是，确切地说，“类矩阵”是什么意思？示例 1 > image(1:9) Error in image.default(1:9) : argument
java - 如何生成包含已解析实体的 XML 文档的*精确*副本
给定一个像这样的 XML 文档: john &title; 我想解析上面的 XML 文档并生成其所有实体已解析的副本。因此，给定上述 XMl 文档，解析器应输出: john
plone - 有一种方法可以在 Plone 中*精确*即时调整图像对象的大小吗？
需要说明的是，这种方法不是我要找的: 事实上，此方法会调整 ImageField 的大小。我想将 Image 对象的大小调整为特定且精确的无比例分辨率。有什么办法吗？ --编辑-- 对我来说，Ima
python - 急切模式下的 TFP 精确 GP 回归
我正在尝试使用 TF2.0 eager 模式执行精确的 GP 回归，基于来自 https://colab.research.google.com/github/tensorflow/probabili

塔克拉玛干

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

java Lucene最佳匹配不是精确匹配