gpt4 book ai didi

用于测试 Solr token 过滤器的 Java 代码?

转载 作者:行者123 更新时间:2023-11-30 06:17:57 28 4
gpt4 key购买 nike

我尝试编写 Java 代码来查看 Solr token 过滤器的工作原理。

  public class TestFilter {

public static void main(String[] args) throws IOException {
StringReader inputText = new StringReader("This is a TEST string");
Map<String, String> param = new HashMap<>();
param.put("luceneMatchVersion", "LUCENE_44");

TokenizerFactory stdTokenFact = new StandardTokenizerFactory(param);
Tokenizer tokenizer = stdTokenFact.create(inputText);

param.put("luceneMatchVersion", "LUCENE_44");
LowerCaseFilterFactory lowerCaseFactory = new LowerCaseFilterFactory(param);
TokenStream tokenStream = lowerCaseFactory.create(tokenizer);

CharTermAttribute termAttrib = (CharTermAttribute) tokenStream.getAttribute(CharTermAttribute.class);
System.out.println("CharTermAttribute Length = " + termAttrib.length());
while (tokenStream.incrementToken()) {
String term = termAttrib.toString();
System.out.println(term);
}
}
}

我得到了这个输出和错误消息。

CharTermAttribute Length = 0
Exception in thread "main" java.lang.NullPointerException
at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:923)
at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:1133)
at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:171)
at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
at com.utsav.solr.TestFilter.main(TestFilter.java:31)

为什么 termAttrib.length() 给出零?

我错过了什么?

最佳答案

正在关注 the JavaDoc of TokenStream

The workflow of the new TokenStream API is as follows:

  1. Instantiation of TokenStream/TokenFilters which add/get attributes to/from the AttributeSource.
  2. The consumer calls TokenStream.reset().
  3. The consumer retrieves attributes from the stream and stores local references to all attributes it wants to access.
  4. The consumer calls incrementToken() until it returns false consuming the attributes after each call.
  5. The consumer calls end() so that any end-of-stream operations can be performed.
  6. The consumer calls close() to release any resource when finished using the TokenStream.

您需要按如下方式重写您的方法

public static void main(String[] args) throws IOException {
StringReader inputText = new StringReader("This is a TEST string");
Map<String, String> param = new HashMap<>();
param.put("luceneMatchVersion", "LUCENE_44");

TokenizerFactory stdTokenFact = new StandardTokenizerFactory(param);
Tokenizer tokenizer = stdTokenFact.create(inputText);

param.put("luceneMatchVersion", "LUCENE_44");
LowerCaseFilterFactory lowerCaseFactory = new LowerCaseFilterFactory(param);
TokenStream tokenStream = lowerCaseFactory.create(tokenizer);

CharTermAttribute termAttrib = (CharTermAttribute) tokenStream.getAttribute(CharTermAttribute.class);

tokenStream.reset();

while (tokenStream.incrementToken()) {
System.out.println("CharTermAttribute Length = " + termAttrib.length());

System.out.println(termAttrib.toString());
}

tokenStream.end();
tokenStream.close();
}

这产生了以下输出

CharTermAttribute Length = 4
this
CharTermAttribute Length = 2
is
CharTermAttribute Length = 1
a
CharTermAttribute Length = 4
test
CharTermAttribute Length = 6
string

编辑 正如评论中提到的,不需要调用 tokenStream.getAttribute,正如 JavaDoc 中指出的那样

Note that only one instance per AttributeImpl is created and reused for every token. This approach reduces object creation and allows local caching of references to the AttributeImpls.

关于用于测试 Solr token 过滤器的 Java 代码?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25381564/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com