gpt4 book ai didi

python - 如何使用新的训练数据集更新训练有素的 space ner 模型?

转载 作者:行者123 更新时间:2023-12-05 04:39:50 25 4
gpt4 key购买 nike

我是 nlp 的新手,我开始学习如何在 spacy 中训练自定义 ner。

TRAIN_DATA = [
('what is the price of polo?', {'entities': [(21, 25, 'Product')]}),
('what is the price of ball?', {'entities': [(21, 25, 'Product')]}),
('what is the price of jegging?', {'entities': [(21, 28, 'Product')]}),
('what is the price of t-shirt?', {'entities': [(21, 28, 'Product')]}),
('what is the price of jeans?', {'entities': [(21, 26, 'Product')]}),
('what is the price of bat?', {'entities': [(21, 24, 'Product')]}),
('what is the price of shirt?', {'entities': [(21, 26, 'Product')]}),
('what is the price of bag?', {'entities': [(21, 24, 'Product')]}),
('what is the price of cup?', {'entities': [(21, 24, 'Product')]}),
('what is the price of jug?', {'entities': [(21, 24, 'Product')]}),
('what is the price of plate?', {'entities': [(21, 26, 'Product')]}),
('what is the price of glass?', {'entities': [(21, 26, 'Product')]}),
('what is the price of moniter?', {'entities': [(21, 28, 'Product')]}),
('what is the price of desktop?', {'entities': [(21, 28, 'Product')]}),
('what is the price of bottle?', {'entities': [(21, 27, 'Product')]}),
('what is the price of mouse?', {'entities': [(21, 26, 'Product')]}),
('what is the price of keyboad?', {'entities': [(21, 28, 'Product')]}),
('what is the price of chair?', {'entities': [(21, 26, 'Product')]}),
('what is the price of table?', {'entities': [(21, 26, 'Product')]}),
('what is the price of watch?', {'entities': [(21, 26, 'Product')]})
]

第一次训练空白spacy模型:

def train_spacy(data,iterations):
TRAIN_DATA = data
nlp = spacy.blank('en') # create blank Language class
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)


# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])

# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(iterations):
print("Statring iteration " + str(itn))
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update(
[text], # batch of texts
[annotations], # batch of annotations
drop=0.2, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights
losses=losses)
print(losses)
return nlp


start_training = train_spacy(TRAIN_DATA, 20)

保存我训练的 spacy 模型:

# Saveing the trained model
start_training.to_disk("spacy_start_model")

我的问题是如何用新的训练数据更新保存的模型?新的训练数据:

TRAIN_DATA_2 = [('Who is Chaka Khan?', {"entities": [(7, 17, 'PERSON')]}),
('I like London and Berlin.', {"entities": [(7, 13, 'LOC')]})]

有人可以帮我解决这个问题吗?提前致谢!

最佳答案

据我所知,您可以使用新数据示例重新训练您的模型,但您现在可以从现有模型开始,而不是从空白模型开始。

为了实现这一点,它将首先从您的 train_spacy 方法中删除以下行,并且可能会接收模型作为参数:

nlp = spacy.blank('en')  # create blank Language class

然后重新训练您的模型,而不是加载 spacy 空白模型并传递给您的训练方法,使用 load 方法加载您现有的模型,然后调用您的训练方法(阅读更多关于 spacy save/加载 here )。

start_training = spacy.load("spacy_start_model") 

最后一个建议,在我的实践中,我通过从现有模型(例如 en_core_web_mden_core_web_lg 重新训练一个 spacy NER 模型,添加我的自定义实体,获得了更好的结果,而不是从空旷的空白模型中从头开始训练。

全部:

  1. 方法更新
def train_spacy(data, iterations, nlp):  # <-- Add model as nlp parameter
TRAIN_DATA = data
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)


# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])

# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(iterations):
print("Statring iteration " + str(itn))
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update(
[text], # batch of texts
[annotations], # batch of annotations
drop=0.2, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights
losses=losses)
print(losses)
return nlp

nlp = spacy.blank('en') # create blank Language class
start_training = train_spacy(TRAIN_DATA, 20, nlp)
  1. 重新训练您的模型
TRAIN_DATA_2 = [('Who is Chaka Khan?', {"entities": [(7, 17, 'PERSON')]}),
('I like London and Berlin.', {"entities": [(7, 13, 'LOC')]})]

nlp = spacy.load("spacy_start_model") # <-- Now your base model is your custom model
start_training = train_spacy(TRAIN_DATA_2, 20, nlp)

我希望这对你有用!

关于python - 如何使用新的训练数据集更新训练有素的 space ner 模型?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70391922/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com