gpt4 book ai didi

nlp - 将数据集转换为 CoNLL 格式。用 O 标记剩余的标记

转载 作者:行者123 更新时间:2023-12-05 04:19:53 26 4
gpt4 key购买 nike

我有一个手动注释的数据集,其中包含以下格式的记录:

{
"id": 1,
"text": "At the end of each fiscal quarter, for the four consecutive fiscal quarters ending as of such fiscal quarter end, from the date of the Third Amendment and until December 30, 1996, the Company shall maintain a fixed charge coverage ratio of not less than 1.25 to 1.0.",
"label": [
[
209,
230,
"COV_3"
],
[
379,
390,
"VAL_3"
]
],
}

在上面的示例中,“label” 代表我在数据集中拥有的自定义实体。在上面显示的示例中,短语 fixed charge coverage 位于位置 [309, 336] 并被赋予标签 COV_3。同样,短语 1.25 to 1.0 位于 [379, 390] 并被赋予标签 VAL_3

现在,我想在这个数据集上微调一些 transformer 模型,比如 BERT,但是,我意识到数据集必须是 CoNLL 格式。或者至少,必须标记每个数据点的所有标记。有什么方法可以轻松地用标签 "O" 标记剩余的标记,或者我可以将此数据集转换为 CoNLL 格式?

最佳答案

您使用 spacy 通过内置的实用方法将字符偏移量注释标记化并转换为 IOB 标记。请注意,这将跳过任何不与标记边界对齐的跨度,因此您可能需要自定义标记器或在创建 Doc 时提供来自其他来源的标记化。

问题中的字符偏移量与文本不一致,在下面进行了修改。

# tested with spacy v3.4.3, should work with spacy v3.x
import spacy
from spacy.training.iob_utils import biluo_to_iob, doc_to_biluo_tags

data = {
"id": 1,
"text": "At the end of each fiscal quarter, for the four consecutive fiscal quarters ending as of such fiscal quarter end, from the date of the Third Amendment and until December 30, 1996, the Company shall maintain a fixed charge coverage ratio of not less than 1.25 to 1.0.",
"label": [[209, 230, "COV_3"], [254, 265, "VAL_3"]],
}

nlp = spacy.blank("en")

# tokenize the text to create a doc
doc = nlp(data["text"])

# convert annotation to entity spans and add them to the doc
ents = []
for start, end, label in data["label"]:
span = doc.char_span(start, end, label=label)
if span is not None:
ents.append(span)
else:
print(
"Skipping span (does not align to tokens):",
start,
end,
label,
doc.text[start:end],
)
doc.ents = ents

# convert doc annotation to IOB tags
for token, iob_tag in zip(doc, biluo_to_iob(doc_to_biluo_tags(doc))):
print(token.text + " " + iob_tag)

输出:

At O
the O
end O
of O
each O
fiscal O
quarter O
, O
for O
the O
four O
consecutive O
fiscal O
quarters O
ending O
as O
of O
such O
fiscal O
quarter O
end O
, O
from O
the O
date O
of O
the O
Third O
Amendment O
and O
until O
December O
30 O
, O
1996 O
, O
the O
Company O
shall O
maintain O
a O
fixed B-COV_3
charge I-COV_3
coverage I-COV_3
ratio O
of O
not O
less O
than O
1.25 B-VAL_3
to I-VAL_3
1.0 I-VAL_3
. O

这些是 4 列 CoNLL 2003 格式的第 1 列和第 4 列。您可能想要为句子边界插入空行或添加特殊的文档边界线,并且您可能需要为第二个/第三个标签和 block 列使用一些实数或占位符值,以便与其他工具一起使用。

关于nlp - 将数据集转换为 CoNLL 格式。用 O 标记剩余的标记,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/74664286/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com