gpt4 book ai didi

python - 使用 Whoosh 的深度 NLP 管道

转载 作者:太空宇宙 更新时间:2023-11-04 04:54:12 25 4
gpt4 key购买 nike

我对 NLP 和 IR 程序还很陌生。我正在尝试实现一个深度 NLP 管道,即将词形还原、依赖解析功能添加到句子索引中。以下是我的架构和搜索器。

my_analyzer = RegexTokenizer()| StopFilter()| LowercaseFilter() | StemFilter() | Lemmatizer()
pos_analyser = RegexTokenizer() | StopFilter()| LowercaseFilter() | PosTagger()
schema = Schema(id=ID(stored=True, unique=True), stem_text=TEXT(stored= True, analyzer=my_analyzer), pos_tag= pos_analyser)

for sentence in sent_tokenize_list1:
writer.add_document(stem_text = sentence, pos_tag = sentence)
for sentence in sent_tokenize_list2:
writer.add_document(stem_text = sentence, pos_tag = sentence)
writer.commit()
with ix.searcher() as searcher:
og = qparser.OrGroup.factory(0.9)
query_text = MultifieldParser(["stem_text","pos_tag"], schema = ix.schema, group= og).parse(
"who is controlling the threat of locusts?")
results = searcher.search(query_text, sortedby= scores, limit = 10 )

这是自定义分析器。

class PosTagger(Filter):
def __eq__(self, other):
return (other
and self.__class__ is other.__class__
and self.__dict__ == other.__dict__)

def __ne__(self, other):
return not self == other

def __init__(self):
self.cache = {}

def __call__(self, tokens):
assert hasattr(tokens, "__iter__")
words = []
tokens1, tokens2 = itertools.tee(tokens)
for t in tokens1:
words.append(t.text)
tags = pos_tag(words)
i=0
for t in tokens2:
t.text = tags[i][0] + " "+ tags[i][1]
i += 1
yield t

我收到以下错误。

whoosh.fields.FieldConfigurationError: CompositeAnalyzer(RegexTokenizer(expression=re.compile('\w+(\.?\w+)*'), gaps=False), StopFilter(stops=frozenset({'for', 'will', 'tbd', 'with', 'and', 'the', 'if', 'it', 'by', 'is', 'are', 'this', 'as', 'when', 'us', 'or', 'from', 'yet', 'you', 'have', 'can', 'be', 'we', 'of', 'to', 'on', 'a', 'an', 'your', 'at', 'in', 'may', 'not', 'that'}), min=2, max=None, renumber=True), LowercaseFilter(), PosTagger(cache={})) is not a FieldType object

我做错了吗?这是将 NLP 管道添加到搜索引擎的正确方法吗?

最佳答案

pos_tag 应该分配给字段 TEXT(stored= True, analyzer=pos_analyzer) 而不是直接分配给 pos_analyser

所以在schema中你应该有:

schema = Schema(id=ID(stored=True, unique=True), stem_text=TEXT(stored= True, analyzer=my_analyzer), post_tag=TEXT(stored= True, analyzer=pos_analyzer))

关于python - 使用 Whoosh 的深度 NLP 管道,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47458616/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com