gpt4 book ai didi

python - spaCy (v3.0) `nlp.make_doc(text)` 和 `nlp(text)` 之间的区别?为什么训练时要用 `nlp.make_doc(text)`?

转载 作者:行者123 更新时间:2023-12-05 03:45:06 28 4
gpt4 key购买 nike

我知道我们应该创建 Example对象并将其传递给 nlp.update() 方法。根据 docs 中的示例, 我们有

for raw_text, entity_offsets in train_data:
doc = nlp.make_doc(raw_text)
example = Example.from_dict(doc, {"entities": entity_offsets})
nlp.update([example], sgd=optimizer)

并查看 source codemake_doc() 方法中,我们似乎只是对输入文本进行标记,然后对标记进行注释。

但是 Example 对象应该有引用/“黄金标准”和预测值。当我们调用 nlp.make_doc() 时,信息如何最终出现在文档中?

此外,当尝试从 Example 对象中获取预测的实体标签(使用训练有素的 nlp 管道)时,我没有得到任何实体(尽管如果我有使用 nlp(text) 创建对象。如果我尝试使用 nlp(text) 而不是 nlp.make_doc(text) ,训练就会崩溃

...
>>> spacy.pipeline._parser_internals.ner.BiluoPushDown.set_costs()
ValueError()

最佳答案

您也可以在 Github 讨论板上随意提出此类问题。也感谢您花时间思考这个问题并在提问之前阅读了一些代码。我希望每个问题都是这样。

无论如何。我认为 Example.from_dict() 构造函数可能会妨碍理解类的工作方式。这会让您更清楚吗?

from spacy.tokens import Doc, Span
from spacy.training import Example
import spacy
nlp = spacy.blank("en")

# Build a reference Doc object, representing the gold standard.
y = Doc(
nlp.vocab,
words=["I", "work", "at", "Berlin!", ".", "It", "'s", "a", "hipster", "bar", "."]
)
# There are other ways we could set up the Doc object, including just passing
# stuff into the constructor. I wanted to show modifying the Doc to set annotations.
ent_start = y.text.index("Berlin!")
assert ent_start != -1
ent_end = ent_start + len("Berlin!")
y.ents = [y.char_span(ent_start, ent_end, label="ORG")]
# Okay, so we have our gold-standard, aka reference aka y, Doc object.
# Now, at runtime we won't necessarily be tokenizing that input text that way.
# It's a weird entity. If we only learn from the gold tokens, we can never learn
# to tag this correctly, no matter how many examples we see, if the predicted tokens
# don't match this tokenization. Because we'll always be learning from "Berlin!" but
# seeing "Berlin", "!" at runtime. We'll have train/test skew. Since spaCy cares how
# it does on actual text, not just on the benchmark (which is usually run with
# gold tokens), we want to train from samples that have the runtime tokenization. So
# the Example object holds a pair (x, y), where the x is the input.
x = nlp.make_doc(y.text)
example = Example(x, y)
# Show the aligned gold-standard NER tags. These should have the entity as B-ORG L-ORG.
print(example.get_aligned_ner())

可能解释这一点的另一条信息是管道组件尝试处理部分注释,以便您可以拥有预设某些实体的规则。这就是当你有一个完全注释的 Doc 作为 x 时发生的事情——它将这些注释作为输入的一部分,并且当模型没有有效的 Action 时它试图构建最佳的 Action 序列以供学习。可以改进这种情况的可用性。

关于python - spaCy (v3.0) `nlp.make_doc(text)` 和 `nlp(text)` 之间的区别?为什么训练时要用 `nlp.make_doc(text)`?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66093326/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com