gpt4 book ai didi

java - KeywordAnalyzer 用于处理带有变音符号的单词的不同拼写

转载 作者:行者123 更新时间:2023-12-02 08:54:21 28 4
gpt4 key购买 nike

如何让 KeywordAnalyzer 识别像 Müller 这样的名字,而不管拼写如何?

KeywordAnalyzer 需要完全匹配,我希望它匹配 Müller,但也匹配 Mueller(ue 二元组)和穆勒

最佳答案

下面的自定义分析器可以解决这个问题:

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.KeywordTokenizer;
import org.apache.lucene.analysis.de.GermanNormalizationFilter;
import org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter;

public final class KeywordAnalyzerDE extends Analyzer {
public KeywordAnalyzerDE() {
}

@Override
protected TokenStreamComponents createComponents(final String fieldName) {
final Tokenizer source = new KeywordTokenizer();

TokenStream result;
result = new GermanNormalizationFilter(source);
result = new ASCIIFoldingFilter(result);

return new TokenStreamComponents(source, result);
}
}

关键是GermanNormalizationFilter:

It allows for the fact that ä, ö and ü are sometimes written as ae, oe and ue.

  • 'ß' is replaced by 'ss'
  • 'ä', 'ö', 'ü' are replaced by 'a', 'o', 'u', respectively.
  • 'ae' and 'oe' are replaced by 'a', and 'o', respectively.
  • 'ue' is replaced by 'u', when not following a vowel or q.

我添加了ASCIIFoldingFilter,以防处理后的文本中存在其他变音符号。

查看源代码确实很有帮助:

关于java - KeywordAnalyzer 用于处理带有变音符号的单词的不同拼写,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60579871/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com