gpt4 book ai didi

python - NLTK 中的斯坦福 NER 未正确标记多个句子 - Python

转载 作者:太空宇宙 更新时间:2023-11-03 16:50:18 26 4
gpt4 key购买 nike

我有一个函数,它使用斯坦福 NER 返回给定文本正文中的命名实体。

def get_named_entities(text):
load_ner_files()

print text[:100] # to show that the text is fine
text_split = text.split()
print text_split # to show the split is working fine
result = "named entities = ", st.tag(text_split)
return result

我正在使用报纸 Python 包从 url 加载文本。

def get_page_text():
url = "https://aeon.co/essays/elon-musk-puts-his-case-for-a-multi-planet-civilisation"
page = Article(url)
page.download()
page.parse()
return unicodedata.normalize('NFKD', page.text).encode('ascii', 'ignore')

但是,当我运行该函数时,我得到以下输出:

['Fuck', 'Earth!', 'Elon', 'Musk', 'said', 'to', 'me,', 'laughing.', 'Who', 'cares', 'about', 'Earth?'......... (continued)
named entities = [('Fuck', 'O'), ('Earth', 'O'), ('!', 'O')]

所以我的问题是,为什么只标记前三个单词?

最佳答案

假设已正确设置 NLTK v3.2,请参阅

TL;DR:

pip install -U nltk

conda update nltk 
<小时/>

设置NLTK和Stanford Tools后(记得设置环境变量):

import time
import urllib.request
from itertools import chain

from bs4 import BeautifulSoup
from nltk import word_tokenize, sent_tokenize
from nltk.tag import StanfordNERTagger

class Article:
def __init__(self, url, encoding='utf8'):
self.url = url
self.encoding='utf8'
self.text = self.fetch_url_text()
self.process_text()

def fetch_url_text(self):
response = urllib.request.urlopen(self.url)
self.data = response.read().decode(self.encoding)
self.bsoup = BeautifulSoup(self.data, 'html.parser')
return '\n'.join([paragraph.text for paragraph
in self.bsoup.find_all('p')])

def process_text(self):
self.paragraphs = [sent_tokenize(p.strip())
for p in self.text.split('\n') if p]
_sents = list(chain(*self.paragraphs))
self.sents = [word_tokenize(sent) for sent in _sents]
self.words = list(chain(*self.sents))


url = 'https://aeon.co/essays/elon-musk-puts-his-case-for-a-multi-planet-civilisation'

a1 = Article(url)
three_sentences = a1.sents[20:23]

st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')


# Tag multiple sentences at one go.
start = time.time()
tagged_sents = st.tag_sents(three_sentences)
print ("Tagging took:", time.time() - start)
print (tagged_sents, end="\n\n")

for sent in tagged_sents:
print (sent)
print()

# (Much slower) Tagging sentences one at the time and
# Stanford NER is refired every time.
start = time.time()
tagged_sents = [st.tag(sent) for sent in three_sentences]
print ("Tagging took:", time.time() - start)
for sent in tagged_sents:
print (sent)
print()

[输出]:

Tagging took: 2.537247657775879
[[('Musk', 'PERSON'), ('was', 'O'), ('laughing', 'O'), ('because', 'O'), ('he', 'O'), ('was', 'O'), ('joking', 'O'), (':', 'O'), ('he', 'O'), ('cares', 'O'), ('a', 'O'), ('great', 'O'), ('deal', 'O'), ('about', 'O'), ('Earth', 'LOCATION'), ('.', 'O')], [('When', 'O'), ('he', 'O'), ('is', 'O'), ('not', 'O'), ('here', 'O'), ('at', 'O'), ('SpaceX', 'ORGANIZATION'), (',', 'O'), ('he', 'O'), ('is', 'O'), ('running', 'O'), ('an', 'O'), ('electric', 'O'), ('car', 'O'), ('company', 'O'), ('.', 'O')], [('But', 'O'), ('this', 'O'), ('is', 'O'), ('his', 'O'), ('manner', 'O'), ('.', 'O')]]

[('Musk', 'PERSON'), ('was', 'O'), ('laughing', 'O'), ('because', 'O'), ('he', 'O'), ('was', 'O'), ('joking', 'O'), (':', 'O'), ('he', 'O'), ('cares', 'O'), ('a', 'O'), ('great', 'O'), ('deal', 'O'), ('about', 'O'), ('Earth', 'LOCATION'), ('.', 'O')]
[('When', 'O'), ('he', 'O'), ('is', 'O'), ('not', 'O'), ('here', 'O'), ('at', 'O'), ('SpaceX', 'ORGANIZATION'), (',', 'O'), ('he', 'O'), ('is', 'O'), ('running', 'O'), ('an', 'O'), ('electric', 'O'), ('car', 'O'), ('company', 'O'), ('.', 'O')]
[('But', 'O'), ('this', 'O'), ('is', 'O'), ('his', 'O'), ('manner', 'O'), ('.', 'O')]

Tagging took: 7.375355243682861
[('Musk', 'PERSON'), ('was', 'O'), ('laughing', 'O'), ('because', 'O'), ('he', 'O'), ('was', 'O'), ('joking', 'O'), (':', 'O'), ('he', 'O'), ('cares', 'O'), ('a', 'O'), ('great', 'O'), ('deal', 'O'), ('about', 'O'), ('Earth', 'LOCATION'), ('.', 'O')]
[('When', 'O'), ('he', 'O'), ('is', 'O'), ('not', 'O'), ('here', 'O'), ('at', 'O'), ('SpaceX', 'ORGANIZATION'), (',', 'O'), ('he', 'O'), ('is', 'O'), ('running', 'O'), ('an', 'O'), ('electric', 'O'), ('car', 'O'), ('company', 'O'), ('.', 'O')]
[('But', 'O'), ('this', 'O'), ('is', 'O'), ('his', 'O'), ('manner', 'O'), ('.', 'O')]

关于python - NLTK 中的斯坦福 NER 未正确标记多个句子 - Python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35899656/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com