gpt4 book ai didi

python-3.x - 使用 Spacy 从文本文件中提取名称

转载 作者:行者123 更新时间:2023-12-03 16:58:08 28 4
gpt4 key购买 nike

我有一个文本文件,其中包含如下所示的行:

Electronically signed : Wes Scott, M.D.; Jun 26 2010 11:10AM CST

The patient was referred by Dr. Jacob Austin.

Electronically signed by Robert Clowson, M.D.; Janury 15 2015 11:13AM CST

Electronically signed by Dr. John Douglas, M.D.; Jun 16 2017 11:13AM CST

The patient was referred by
Dr. Jayden Green Olivia.

我想使用 Spacy 提取所有名称。我正在使用 Spacy 的词性标记和实体识别,但无法获得成功。
我可以知道它是如何做到的吗?任何帮助将是可观的

我正在以这种方式使用一些代码:
import spacy
nlp = spacy.load('en')
document_string= " Electronically signed by stupid: Dr. John Douglas, M.D.;
Jun 13 2018 11:13AM CST"
doc = nlp(document_string)
for sentence in doc.ents:
print(sentence, sentence.label_)

最佳答案

模型精度问题

所有模型的问题在于它们没有 100% 的准确度,即使使用更大的模型也无助于识别日期。 Here是 NER 模型的准确率值(F 分数、精度、召回率)——它们都在 86% 左右。

document_string = """ 
Electronically signed : Wes Scott, M.D.; Jun 26 2010 11:10AM CST
The patient was referred by Dr. Jacob Austin.
Electronically signed by Robert Clowson, M.D.; Janury 15 2015 11:13AM CST
Electronically signed by Dr. John Douglas, M.D.; Jun 16 2017 11:13AM CST
The patient was referred by
Dr. Jayden Green Olivia.
"""

对于小模型,两个日期项目被标记为“人”:
import spacy                                                                                                                            

nlp = spacy.load('en')
sents = nlp(document_string)
[ee for ee in sents.ents if ee.label_ == 'PERSON']
# Out:
# [Wes Scott,
# Jun 26,
# Jacob Austin,
# Robert Clowson,
# John Douglas,
# Jun 16 2017,
# Jayden Green Olivia]

用更大的模型 en_core_web_md结果在精度方面更差,因为存在三个错误分类的实体。
nlp = spacy.load('en_core_web_md')                                                                                                                  
sents = nlp(document_string)
# Out:
#[Wes Scott,
# Jun 26,
# Jacob Austin,
# Robert Clowson,
# Janury,
# John Douglas,
# Jun 16 2017,
# Jayden Green Olivia]

我还尝试了其他模型( xx_ent_wiki_smen_core_web_md ),但它们也没有带来任何改进。

如何使用规则来提高准确性?

在这个小例子中,不仅文档似乎具有清晰的结构,而且错误分类的实体都是日期。那么为什么不将初始模型与基于规则的组件结合起来呢?

好消息是,在 Spacy 中:

it's possible can combine statistical and rule-based components in a variety of ways. Rule-based components can be used to improve the accuracy of statistical models



(来自 https://spacy.io/usage/rule-based-matching#models-rules)

因此,按照示例并使用 dateparser库(人类可读日期的解析器)我已经组合了一个基于规则的组件,它在这个例子中运行良好:
from spacy.tokens import Span
import dateparser

def expand_person_entities(doc):
new_ents = []
for ent in doc.ents:
# Only check for title if it's a person and not the first token
if ent.label_ == "PERSON":
if ent.start != 0:
# if person preceded by title, include title in entity
prev_token = doc[ent.start - 1]
if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
new_ent = Span(doc, ent.start - 1, ent.end, label=ent.label)
new_ents.append(new_ent)
else:
# if entity can be parsed as a date, it's not a person
if dateparser.parse(ent.text) is None:
new_ents.append(ent)
else:
new_ents.append(ent)
doc.ents = new_ents
return doc

# Add the component after the named entity recognizer
# nlp.remove_pipe('expand_person_entities')
nlp.add_pipe(expand_person_entities, after='ner')

doc = nlp(document_string)
[(ent.text, ent.label_) for ent in doc.ents if ent.label_=='PERSON']
# Out:
# [(‘Wes Scott', 'PERSON'),
# ('Dr. Jacob Austin', 'PERSON'),
# ('Robert Clowson', 'PERSON'),
# ('Dr. John Douglas', 'PERSON'),
# ('Dr. Jayden Green Olivia', 'PERSON')]

关于python-3.x - 使用 Spacy 从文本文件中提取名称,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51490620/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com