gpt4 book ai didi

python - 根据 spacy 训练数据中的 NER 实体标签过滤数据

转载 作者:行者123 更新时间:2023-12-04 17:10:08 24 4
gpt4 key购买 nike

我有一个使用 Spacy 的 NER 训练数据,格式如下。

[('Christmas Perot 2021 TSO\nSkip to Main Content HOME CONCERTS EVENTS ABOUT STAFF EDUCATION SUPPORT US More Use tab to navigate through the menu items. BUY TICKETS SUNDAY, DECEMBER 12, 2021 I PEROT THEATRE I 4:00 PM\nPOPS I Christmas at The Perot\nCLICK HERE to purchase tickets, or contact the Texarkana Symphony Orchestra at 870.773.3401\nA Texarkana Tradition Join the TSO, the Texarkana Jazz Orchestra, and the TSO Chamber Singers, for this holiday concert for the whole family.\nDon’t miss seeing the winner of TSO’s 11th Annual Celebrity Conductor Competition\nBack to Events 2019 Texarkana Symphony Orchestra',
{'entities': [(375, 399, 'organization'),
(290, 318, 'organization'),
(220, 242, 'production_name'),
(169, 186, 'performance_date'),
(189, 202, 'auditorium'),
(205, 212, 'performance_starttime'),
(409, 428, 'organization')]})]

数据是元组中的第一个元素。在实体中,数字表示实体在数据中的字符位置(开始和结束)。有些行没有任何实体。例如第一行 Christmas Perot 2021 TSO 没有任何实体。我需要删除没有任何实体的句子。可以根据 .\n 字符删除句子。我根据字符编号获得了实体数据,但我没有设法删除未标记的句子

代码

from tqdm import tqdm
import spacy
nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object
for text, annot in tqdm(train_data): # data in previous format
doc = nlp.make_doc(text) # create doc object from text
ents = []
for start, end, label in annot["entities"]: # add character indexes
span = doc.char_span(start, end, label=label, alignment_mode="contract")
print(start,end,span,label)
if span is None:
print("Skipping entity")
else:
ents.append(span)
doc.ents = ents # label the text with the ents

最佳答案

这个怎么样:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
import numpy as np

foo = \
[('''Christmas Perot 2021 TSO
Skip to Main Content HOME CONCERTS EVENTS ABOUT STAFF EDUCATION SUPPORT US More Use tab to navigate through the menu items. BUY TICKETS SUNDAY, DECEMBER 12, 2021 I PEROT THEATRE I 4:00 PM
POPS I Christmas at The Perot
CLICK HERE to purchase tickets, or contact the Texarkana Symphony Orchestra at 870.773.3401
A Texarkana Tradition Join the TSO, the Texarkana Jazz Orchestra, and the TSO Chamber Singers, for this holiday concert for the whole family.
Don\xe2\x80\x99t miss seeing the winner of TSO\xe2\x80\x99s 11th Annual Celebrity Conductor Competition
Back to Events 2019 Texarkana Symphony Orchestra''',
{'entities': [
(375, 399, 'organization'),
(290, 318, 'organization'),
(220, 242, 'production_name'),
(169, 186, 'performance_date'),
(189, 202, 'auditorium'),
(205, 212, 'performance_starttime'),
(409, 428, 'organization'),
]})]

print(foo[0][0])
sentences = re.split(r'\.|\n', foo[0][0])
sentence_lengths = list(map(len, sentences))

cumulative_sentence_length = np.cumsum(sentence_lengths) - 1

pick_indices = set()

entities = foo[0][1]['entities']

for e in entities:
# only pick the first index (→ second [0])
idx = np.where(e[0] < cumulative_sentence_length)[0][0]
print('\n\nIndex:', idx, 'Entity:', e, 'Range:', [
[0, *cumulative_sentence_length][idx],
[0, *cumulative_sentence_length][idx+1]
], '\nSentence:', sentences[idx])
pick_indices.add(idx)

print(pick_indices)
print('\n'.join([sentences[i] for i in pick_indices]))

输出的是第一、二、三、四(= {2, 3, 4, 7})句。这个想法是为了

  1. 拆分句子
  2. 累计句子的长度
  3. 检查实体起始索引是否在范围内(并只选择第一个索引)
  4. (可选)您可以使用实体的结束索引自行检查

查看 cumulative_sentence_length 变量,它包含值 [ 23 145 209 238 320 323 327 467 467 552 600],这是句子间隔的上限.

当您处理数据科学主题时,我认为在这里使用 numpy 对您来说不是障碍。

关于python - 根据 spacy 训练数据中的 NER 实体标签过滤数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/69627517/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com