gpt4 book ai didi

regex - 具有正则表达式的命名实体识别: NLTK

转载 作者:行者123 更新时间:2023-12-04 02:58:45 25 4
gpt4 key购买 nike

我一直在使用NLTK工具包。我经常遇到这个问题,并在网上搜索解决方案,但是没有一个令人满意的答案。因此,我将查询放在这里。

很多时候,NER不会将连续的NNP标记为一个NE。我认为编辑NER以使用RegexpTagger也可以改善NER。

例子:

输入:

Barack Obama is a great person.



输出:

Tree('S', [Tree('PERSON', [('Barack', 'NNP')]), Tree('ORGANIZATION', [('Obama', 'NNP')]), ('is', 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('person', 'NN'), ('.', '.')])



然而

输入:

Former Vice President Dick Cheney told conservative radio host Laura Ingraham that he "was honored" to be compared to Darth Vader while in office.



输出:

Tree('S', [('Former', 'JJ'), ('Vice', 'NNP'), ('President', 'NNP'), Tree('NE', [('Dick', 'NNP'), ('Cheney', 'NNP')]), ('told', 'VBD'), ('conservative', 'JJ'), ('radio', 'NN'), ('host', 'NN'), Tree('NE', [('Laura', 'NNP'), ('Ingraham', 'NNP')]), ('that', 'IN'), ('he', 'PRP'), ('', ''), ('was', 'VBD'), ('honored', 'VBN'), ("''", "''"), ('to', 'TO'), ('be', 'VB'), ('compared', 'VBN'), ('to', 'TO'), Tree('NE', [('Darth', 'NNP'), ('Vader', 'NNP')]), ('while', 'IN'), ('in', 'IN'), ('office', 'NN'), ('.', '.')])



此处正确提取了Vice/NNP,President/NNP(Dick/NNP,Cheney/NNP)。

因此,我认为如果首先使用nltk.ne_chunk,然后如果两个连续的树是NNP,则很有可能两者都引用一个实体。

任何建议将不胜感激。我正在寻找方法上的缺陷。

谢谢。

最佳答案

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
prev = None
continuous_chunk = []
current_chunk = []

for i in chunked:
if type(i) == Tree:
current_chunk.append(" ".join([token for token, pos in i.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue

if continuous_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)

return continuous_chunk

txt = "Barack Obama is a great person."
print get_continuous_chunks(txt)

[出去]:
['Barack Obama']

但是请注意,如果连续的块不应该是单个网元,则将多个网元合并为一个。我想不出这样的例子,但我相信它会发生。但是,如果它们不是连续的,则上面的脚本可以正常工作:
>>> txt = "Barack Obama is the husband of Michelle Obama."  
>>> get_continuous_chunks(txt)
['Barack Obama', 'Michelle Obama']

关于regex - 具有正则表达式的命名实体识别: NLTK,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24398536/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com