gpt4 book ai didi

python - SpaCy 将新行 (\n) 标记为 GPE 命名实体

转载 作者:行者123 更新时间:2023-12-01 01:07:54 25 4
gpt4 key购买 nike

我正在使用 SpaCy 来获取命名实体。但是,它总是将新线符号错误地标记为命名实体。

下面是输入文本。

mytxt = """<?xml version="1.0"?>

<nitf>

<head>
<title>KNOW YOUR ROLE ON SUPER BOWL LIII.</title>
</head>

<body>

<body.head>

<hedline>
<hl1>KNOW YOUR ROLE ON SUPER BOWL LIII.</hl1>
</hedline>

<distributor>Gale Group</distributor>

</body.head>

<body.content>
<p>Montpelier: <org>Department of Motor Vehicles</org>, has issued the following
news release:</p>

<p>Be a designated sober driver, help save lives. Remember these tips
on game night:</p>

<p>Know your State&apos;s laws: refusing to take a breath test in many
jurisdictions could result in arrest, loss of your driver&apos;s
license, and impoundment of your vehicle. Not to mention the
embarrassment in explaining your situation to family, friends, and
employers.</p>

<p>In case of any query regarding this article or other content needs
please contact: <a href="mailto:editorial@plusmediasolutions.com">editorial@plusmediasolutions.com</a></p>
</body.content>

</body>
</nitf>


"""

下面是我的代码:

    CONTENT_XML_TAG = ('p', 'ul', 'h3', 'h1', 'h2', 'ol')
soup = BeautifulSoup(mytxt, 'xml')
spacy_model = spacy.load('en_core_web_sm')
content = "\n".join([p.get_text() for p in soup.find('body.content').findAll(CONTENT_XML_TAG)])
print(content)

section_spacy = spacy_model(content)
tokenized_sentences = []
for sent in section_spacy.sents:
tokenized_sentences.append(sent)
for s in tokenized_sentences:
labels = [(ent.text, ent.label_) for ent in s.ents]
print(Counter(labels))

打印输出:

Counter({('\n', 'GPE'): 2, ('Department of Motor Vehicles', 'ORG'): 1})
Counter({('\n', 'GPE'): 1})
Counter({('\n', 'GPE'): 2, ('State', 'ORG'): 1})
Counter({('\n', 'GPE'): 3})
Counter({('\n', 'GPE'): 1})

我不敢相信 SpaCy 会有这样的错误分类。我错过了什么吗?

最佳答案

from bs4 import BeautifulSoup
import spacy

CONTENT_XML_TAG = ('p', 'ul', 'h3', 'h1', 'h2', 'ol')
soup = BeautifulSoup(mytxt, 'xml')
spacy_model = spacy.load('en_core_web_sm')
content = "\n".join([p.get_text() for p in soup.find('body.content').findAll(CONTENT_XML_TAG)])
section_spacy = spacy_model(content)

def remove_whitespace_entities(doc):
doc.ents = [e for e in doc.ents if not e.text.isspace()]
return doc

spacy_model.add_pipe(remove_whitespace_entities, after='ner')
doc = spacy_model(content)
print(doc.ents)

关于python - SpaCy 将新行 (\n) 标记为 GPE 命名实体,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55154045/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com