gpt4 book ai didi

python - 使用 jcc 在 pylucene/inheritance 中编写自定义分析器?

转载 作者:太空宇宙 更新时间:2023-11-03 11:11:27 31 4
gpt4 key购买 nike

我想在 Pylucene 中编写自定义分析器。通常在java lucene中,当你写一个analyzer类时,你的类继承lucene的Analyzer类。

但是 pylucene 使用 jcc ,即 java 到 c++/python 的编译器。

那么如何使用 jcc 让 python 类继承自 java 类,尤其是如何编写自定义 pylucene 分析器?

谢谢。

最佳答案

这是一个包装 EdgeNGram 过滤器的分析器示例。

import lucene
class EdgeNGramAnalyzer(lucene.PythonAnalyzer):
'''
This is an example of a custom Analyzer (in this case an edge-n-gram analyzer)
EdgeNGram Analyzers are good for type-ahead
'''

def __init__(self, side, minlength, maxlength):
'''
Args:
side[enum] Can be one of lucene.EdgeNGramTokenFilter.Side.FRONT or lucene.EdgeNGramTokenFilter.Side.BACK
minlength[int]
maxlength[int]
'''
lucene.PythonAnalyzer.__init__(self)
self.side = side
self.minlength = minlength
self.maxlength = maxlength

def tokenStream(self, fieldName, reader):
result = lucene.LowerCaseTokenizer(Version.LUCENE_CURRENT, reader)
result = lucene.StandardFilter(result)
result = lucene.StopFilter(True, result, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
result = lucene.ASCIIFoldingFilter(result)
result = lucene.EdgeNGramTokenFilter(result, self.side, self.minlength, self.maxlength)
return result

这是重新实现 PorterStemmer 的另一个例子

# This sample illustrates how to write an Analyzer 'extension' in Python.
#
# What is happening behind the scenes ?
#
# The PorterStemmerAnalyzer python class does not in fact extend Analyzer,
# it merely provides an implementation for Analyzer's abstract tokenStream()
# method. When an instance of PorterStemmerAnalyzer is passed to PyLucene,
# with a call to IndexWriter(store, PorterStemmerAnalyzer(), True) for
# example, the PyLucene SWIG-based glue code wraps it into an instance of
# PythonAnalyzer, a proper java extension of Analyzer which implements a
# native tokenStream() method whose job is to call the tokenStream() method
# on the python instance it wraps. The PythonAnalyzer instance is the
# Analyzer extension bridge to PorterStemmerAnalyzer.

'''
More explanation...
Analyzers split up a chunk of text into tokens...
Analyzers are applied to an index globally (unless you use perFieldAnalyzer)
Analyzers implement Tokenizers and TokenFilters.
Tokenizers break up string into tokens. TokenFilters break of Tokens into more Tokens or filter out
Tokens
'''

import sys, os
from datetime import datetime
from lucene import *
from IndexFiles import IndexFiles


class PorterStemmerAnalyzer(PythonAnalyzer):

def tokenStream(self, fieldName, reader):

#There can only be 1 tokenizer in each Analyzer
result = StandardTokenizer(Version.LUCENE_CURRENT, reader)
result = StandardFilter(result)
result = LowerCaseFilter(result)
result = PorterStemFilter(result)
result = StopFilter(True, result, StopAnalyzer.ENGLISH_STOP_WORDS_SET)

return result


if __name__ == '__main__':
if len(sys.argv) < 2:
sys.exit("requires at least one argument: lucene-index-path")
initVM()
start = datetime.now()
try:
IndexFiles(sys.argv[1], "index", PorterStemmerAnalyzer())
end = datetime.now()
print end - start
except Exception, e:
print "Failed: ", e

结帐 perFieldAnalyzerWrapper.java还有KeywordAnalyzerTest.py

        analyzer = PerFieldAnalyzerWrapper(SimpleAnalyzer())
analyzer.addAnalyzer("partnum", KeywordAnalyzer())

query = QueryParser(Version.LUCENE_CURRENT, "description",
analyzer).parse("partnum:Q36 AND SPACE")
scoreDocs = self.searcher.search(query, 50).scoreDocs

关于python - 使用 jcc 在 pylucene/inheritance 中编写自定义分析器?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/2012843/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com