gpt4 book ai didi

python - nltk StanfordNERTagger : How to get proper nouns without capitalization

转载 作者:太空狗 更新时间:2023-10-30 01:06:55 26 4
gpt4 key购买 nike

我正在尝试使用 StanfordNERTagger 和 nltk 从一段文本中提取关键字。

docText="John Donk works for POI. Brian Jones wants to meet with Xyz Corp. for measuring POI's Short Term performance Metrics."

words = re.split("\W+",docText)

stops = set(stopwords.words("english"))

#remove stop words from the list
words = [w for w in words if w not in stops and len(w) > 2]

str = " ".join(words)
print str
stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger')
stanfordPosTagList=[word for word,pos in stp.tag(str.split()) if pos == 'NNP']

print "Stanford POS Tagged"
print stanfordPosTagList
tagged = stn.tag(stanfordPosTagList)
print tagged

这给了我

John Donk works POI Brian Jones wants meet Xyz Corp measuring POI Short Term performance Metrics
Stanford POS Tagged
[u'John', u'Donk', u'POI', u'Brian', u'Jones', u'Xyz', u'Corp', u'POI', u'Short', u'Term']
[(u'John', u'PERSON'), (u'Donk', u'PERSON'), (u'POI', u'ORGANIZATION'), (u'Brian', u'ORGANIZATION'), (u'Jones', u'ORGANIZATION'), (u'Xyz', u'ORGANIZATION'), (u'Corp', u'ORGANIZATION'), (u'POI', u'O'), (u'Short', u'O'), (u'Term', u'O')]

很明显,ShortTerm 之类的东西被标记为 NNP。我拥有的数据包含许多这样的实例,其中NNP 单词被大写。这可能是由于拼写错误或者它们是标题。我对此没有太多控制权。

我如何解析或清理数据,以便我可以检测到非 NNP 术语,即使它可能被大写? 我不希望像 ShortTerm 这样的术语被归类为 NNP

此外,不确定为什么 John Donk 被抓获,但 Brian Jones 却没有。可能是因为我的数据中有其他大写的非 NNP 吗?这会对 StanfordNERTagger 处理其他一切的方式产生影响吗?

更新,一种可能的解决方案

这是我打算做的

  1. 取每个单词并转换为小写
  2. 标记小写单词
  3. 如果标签是NNP 那么我们知道原来的词也一定是NNP
  4. 如果不是,那么原始单词的大小写错误

这是我尝试做的

str = " ".join(words)
print str
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger')
for word in str.split():
wl = word.lower()
print wl
w,pos = stp.tag(wl)
print pos
if pos=="NNP":
print "Got NNP"
print w

但这给了我错误

John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics
john
Traceback (most recent call last):
File "X:\crp.py", line 37, in <module>
w,pos = stp.tag(wl)
ValueError: too many values to unpack

我尝试了多种方法,但总是会出现一些错误。 如何标记单个单词?

我不想将整个字符串转换为小写然后标记。如果我这样做,StanfordPOSTagger 返回一个空字符串

最佳答案

首先,查看您的其他问题以设置从命令行或 Python 调用 Stanford CoreNLP:nltk : How to prevent stemming of proper nouns .

对于正确的大小写句子,我们看到 NER 正常工作:

>>> from corenlp import StanfordCoreNLP
>>> nlp = StanfordCoreNLP('http://localhost:9000')
>>> text = ('John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics. '
... 'john donk works poi jones wants meet xyz corp measuring poi short term performance metrics')
>>> output = nlp.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,ner', 'outputFormat': 'json'})
>>> annotated_sent0 = output['sentences'][0]
>>> annotated_sent1 = output['sentences'][1]
>>> for token in annotated_sent0['tokens']:
... print token['word'], token['lemma'], token['pos'], token['ner']
...
John John NNP PERSON
Donk Donk NNP PERSON
works work VBZ O
POI POI NNP ORGANIZATION
Jones Jones NNP ORGANIZATION
wants want VBZ O
meet meet VB O
Xyz Xyz NNP ORGANIZATION
Corp Corp NNP ORGANIZATION
measuring measure VBG O
POI poi NN O
short short JJ O
term term NN O
performance performance NN O
metrics metric NNS O
. . . O

对于小写的句子,你不会为 POS 标签或任何 NER 标签得到 NNP:

>>> for token in annotated_sent1['tokens']:
... print token['word'], token['lemma'], token['pos'], token['ner']
...
john john NN O
donk donk JJ O
works work NNS O
poi poi VBP O
jones jone NNS O
wants want VBZ O
meet meet VB O
xyz xyz NN O
corp corp NN O
measuring measure VBG O
poi poi NN O
short short JJ O
term term NN O
performance performance NN O
metrics metric NNS O

所以你的问题应该是:

  • 您的 NLP 应用程序的最终目标是什么?
  • 为什么您的输入是小写的?是您的行为还是数据的提供方式?

在回答了这些问题之后,您可以继续决定您真正想用 NER 标签做什么,即

  • 如果输入是小写的,这是因为您构建 NLP 工具链的方式,那么

    • DO NOT do that!!! 在正常文本上执行 NER,而不会造成您创建的扭曲。这是因为 NER 是在普通文本上训练的,所以它不会真正脱离普通文本的上下文。
    • 同时尽量不要混合来自不同套件的 NLP 工具,它们通常不会很好地发挥作用,尤其是在 NLP 工具链的末端
  • 如果输入是小写的,因为原始数据是小写的,那么:

  • 如果输入有错误的大小写,例如`Some big Some Small but not all are Proper Noun, then

    • 也试试 truecasing 解决方案。

关于python - nltk StanfordNERTagger : How to get proper nouns without capitalization,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34439208/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com