gpt4 book ai didi

java - Solr WordDelimiterFilter + Lucene 荧光笔

转载 作者:行者123 更新时间:2023-11-30 05:07:35 25 4
gpt4 key购买 nike

我正在尝试从 Lucene 获取 Highlighter 类,以便与来自 Solr 的 WordDelimiterFilter 的标记正常工作。它在 90% 的情况下有效,但如果匹配文本包含“,”(例如“1,500”),则输出不正确:

Expected: 'test 1,500 this'

Observed: 'test 11,500 this'

我目前不确定是荧光笔弄乱了重组还是WordDelimiterFilter弄乱了标记化,但有些事情令人不满意。以下是我的 pom 中的相关依赖项:

org.apache.lucene lucene核心 2.9.3 jar 编译 org.apache.lucene lucene 荧光笔 2.9.3 jar 编译 org.apache.solr solr核心 1.4.0 jar 编译

这是一个简单的 JUnit 测试类,演示了该问题:

package test.lucene;


import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertTrue;


import java.io.IOException;
import java.io.Reader;
import java.util.HashMap;


import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleFragmenter;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.util.Version;
import org.apache.solr.analysis.StandardTokenizerFactory;
import org.apache.solr.analysis.WordDelimiterFilterFactory;
import org.junit.Test;


public class HighlighterTester {
private static final String PRE_TAG = "<b>";
private static final String POST_TAG = "</b>";

private static String[] highlightField( Query query, String fieldName, String text )
throws IOException, InvalidTokenOffsetsException {
SimpleHTMLFormatter formatter = new SimpleHTMLFormatter( PRE_TAG, POST_TAG );
Highlighter highlighter = new Highlighter( formatter, new QueryScorer( query, fieldName ) );
highlighter.setTextFragmenter( new SimpleFragmenter( Integer.MAX_VALUE ) );
return highlighter.getBestFragments( getAnalyzer(), fieldName, text, 10 );
}

private static Analyzer getAnalyzer() {
return new Analyzer() {
@Override
public TokenStream tokenStream( String fieldName, Reader reader ) {
// Start with a StandardTokenizer
TokenStream stream = new StandardTokenizerFactory().create( reader );

// Chain on a WordDelimiterFilter
WordDelimiterFilterFactory wordDelimiterFilterFactory = new WordDelimiterFilterFactory();
HashMap<String, String> arguments = new HashMap<String, String>();
arguments.put( "generateWordParts", "1" );
arguments.put( "generateNumberParts", "1" );
arguments.put( "catenateWords", "1" );
arguments.put( "catenateNumbers", "1" );
arguments.put( "catenateAll", "0" );
wordDelimiterFilterFactory.init( arguments );

return wordDelimiterFilterFactory.create( stream );
}
};
}

@Test
public void TestHighlighter() throws ParseException, IOException, InvalidTokenOffsetsException {
String fieldName = "text";
String text = "test 1,500 this";
String queryString = "1500";
String expected = "test " + PRE_TAG + "1,500" + POST_TAG + " this";

QueryParser parser = new QueryParser( Version.LUCENE_29, fieldName, getAnalyzer() );
Query q = parser.parse( queryString );
String[] observed = highlightField( q, fieldName, text );
for ( int i = 0; i < observed.length; i++ ) {
System.out.println( "\t" + i + ": '" + observed[i] + "'" );
}
if ( observed.length > 0 ) {
System.out.println( "Expected: '" + expected + "'\n" + "Observed: '" + observed[0] + "'" );
assertEquals( expected, observed[0] );
}
else {
assertTrue( "No matches found", false );
}
}
}

大家有什么想法或建议吗?

最佳答案

经过进一步调查,这似乎是 Lucene Highlighter 代码中的一个错误。正如您在这里看到的:

public class TokenGroup {

...

protected boolean isDistinct() {
return offsetAtt.startOffset() >= endOffset;
}

...

代码尝试通过检查起始偏移量是否大于前一个结束偏移量来确定一组标记是否不同。这个问题说明了这种方法的问题。如果您要单步执行这些标记,您会看到它们如下所示:

0-4: 'test', 'test'
5-6: '1', '1'
7-10: '500', '500'
5-10: '1500', '1,500'
11-15: 'this', 'this'

从中您可以看到第三个标记在第二个标记结束后开始,但第四个标记与第二个标记在同一位置开始。预期结果是将标记 2、3 和 4 分组,但根据此实现,标记 3 被视为与 2 分开,因此 2 单独显示,然后 3 和 4 分组,留下以下结果:

Expected: 'test <b>1,500</b> this'
Observed: 'test 1<b>1,500</b> this'

我不确定这是否可以在没有两次传递的情况下完成,一次获取所有索引,第二次将它们组合起来。另外,我不确定在这个特定案例之外会产生什么影响。这里有人有什么想法吗?

编辑

这是我想出的最终源代码。它将正确地对事物进行分组。它似乎也比 Lucene 荧光笔实现简单得多,但不可否认的是,它不能处理不同级别的评分,因为我的应用程序只需要确定文本片段是否突出显示是/否。还值得注意的是,我使用他们的 QueryScorer 对文本片段进行评分,该文本片段确实存在面向术语而不是面向短语的弱点,这意味着搜索字符串“语法或拼写”最终会突出显示,看起来像这样“< b>语法或拼写”,因为 或 很可能会被分析器丢弃。无论如何,这是我的来源:

public TextFragments<E> getTextFragments( TokenStream tokenStream,
String text,
Scorer scorer )
throws IOException, InvalidTokenOffsetsException {
OffsetAttribute offsetAtt = (OffsetAttribute) tokenStream.addAttribute( OffsetAttribute.class );
TermAttribute termAtt = (TermAttribute) tokenStream.addAttribute( TermAttribute.class );
TokenStream newStream = scorer.init( tokenStream );
if ( newStream != null ) {
tokenStream = newStream;
}

TokenGroups tgs = new TokenGroups();
scorer.startFragment( null );
while ( tokenStream.incrementToken() ) {
tgs.add( offsetAtt.startOffset(), offsetAtt.endOffset(), scorer.getTokenScore() );
if ( log.isTraceEnabled() ) {
log.trace( new StringBuilder()
.append( scorer.getTokenScore() )
.append( " " )
.append( offsetAtt.startOffset() )
.append( "-" )
.append( offsetAtt.endOffset() )
.append( ": '" )
.append( termAtt.term() )
.append( "', '" )
.append( text.substring( offsetAtt.startOffset(), offsetAtt.endOffset() ) )
.append( "'" )
.toString() );
}
}

return tgs.fragment( text );
}

private class TokenGroup {
private int startIndex;
private int endIndex;
private float score;

public TokenGroup( int startIndex, int endIndex, float score ) {
this.startIndex = startIndex;
this.endIndex = endIndex;
this.score = score;
}
}

private class TokenGroups implements Iterable<TokenGroup> {
private List<TokenGroup> tgs;

public TokenGroups() {
tgs = new ArrayList<TokenGroup>();
}

public void add( int startIndex, int endIndex, float score ) {
add( new TokenGroup( startIndex, endIndex, score ) );
}

public void add( TokenGroup tg ) {
for ( int i = tgs.size() - 1; i >= 0; i-- ) {
if ( tg.startIndex < tgs.get( i ).endIndex ) {
tg = merge( tg, tgs.remove( i ) );
}
else {
break;
}
}
tgs.add( tg );
}

private TokenGroup merge( TokenGroup tg1, TokenGroup tg2 ) {
return new TokenGroup( Math.min( tg1.startIndex, tg2.startIndex ),
Math.max( tg1.endIndex, tg2.endIndex ),
Math.max( tg1.score, tg2.score ) );
}

private TextFragments<E> fragment( String text ) {
TextFragments<E> fragments = new TextFragments<E>();

int lastEndIndex = 0;
for ( TokenGroup tg : this ) {
if ( tg.startIndex > lastEndIndex ) {
fragments.add( text.substring( lastEndIndex, tg.startIndex ), textModeNormal );
}
fragments.add(
text.substring( tg.startIndex, tg.endIndex ),
tg.score > 0 ? textModeHighlighted : textModeNormal );
lastEndIndex = tg.endIndex;
}

if ( lastEndIndex < ( text.length() - 1 ) ) {
fragments.add( text.substring( lastEndIndex ), textModeNormal );
}

return fragments;
}

@Override
public Iterator<TokenGroup> iterator() {
return tgs.iterator();
}
}

关于java - Solr WordDelimiterFilter + Lucene 荧光笔,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4566532/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com