gpt4 book ai didi

python - [ orth , pos , tag , lema 和 text ] 的 spaCy 文档

转载 作者:太空狗 更新时间:2023-10-29 18:01:20 25 4
gpt4 key购买 nike

我是 spaCy 的新手。我添加了这篇文章作为文档,并使它对像我这样的新手来说很简单。

import spacy
nlp = spacy.load('en')
doc = nlp(u'KEEP CALM because TOGETHER We Rock !')
for word in doc:
print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)
print(word.orth_)

我想了解 orth、lemma、tag 和 pos 的含义?此代码还打印出值 print(word)print(word.orth_)

之间的区别

最佳答案

What the meaning of orth, lemma, tag and pos ?

参见 https://spacy.io/docs/usage/pos-tagging#pos-schemes

What the different between print(word) vs print(word.orth_)

简而言之:

word.orth_word.text是相同的。事实上,cython 属性以下划线结尾,这通常是开发人员不想真正向用户公开的变量。

简而言之:

当您访问 word.orth_属性(property)在https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L537 ,它会尝试访问保存所有单词词汇表的索引:

property orth_:
def __get__(self):
return self.vocab.strings[self.c.lex.orth]

(详见下面 In long self.c.lex.orth的解释)

word.text返回仅环绕 orth_ 的单词的字符串表示形式属性(property),见https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L128

property text:
def __get__(self):
return self.orth_

当你打印 print(word) 时, 它调用 __repr__返回 word.__unicode__ 的 dunder 函数或 word.__byte__它指向 word.text变量,参见 https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L55

cdef class Token:
"""
An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
"""
def __cinit__(self, Vocab vocab, Doc doc, int offset):
self.vocab = vocab
self.doc = doc
self.c = &self.doc.c[offset]
self.i = offset

def __hash__(self):
return hash((self.doc, self.i))

def __len__(self):
"""
Number of unicode characters in token.text.
"""
return self.c.lex.length

def __unicode__(self):
return self.text

def __bytes__(self):
return self.text.encode('utf8')

def __str__(self):
if is_config(python3=True):
return self.__unicode__()
return self.__bytes__()

def __repr__(self):
return self.__str__()

长:

让我们尝试一步一步地完成这个过程:

>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp(u'This is a foo bar sentence.')
>>> type(doc)
<type 'spacy.tokens.doc.Doc'>

这句话传进nlp()之后函数,它产生一个 spacy.tokens.doc.Doc 对象,来自文档:

cdef class Doc:
"""
A sequence of `Token` objects. Access sentences and named entities,
export annotations to numpy arrays, losslessly serialize to compressed
binary strings.
Aside: Internals
The `Doc` object holds an array of `TokenC` structs.
The Python-level `Token` and `Span` objects are views of this
array, i.e. they don't own the data themselves.
Code: Construction 1
doc = nlp.tokenizer(u'Some text')
Code: Construction 2
doc = Doc(nlp.vocab, orths_and_spaces=[(u'Some', True), (u'text', True)])
"""

所以 spacy.tokens.doc.Doc对象是 spacy.tokens.token.Token 的序列目的。 Token内对象,我们看到了一波cython property枚举,例如在 https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162

property orth:
def __get__(self):
return self.c.lex.orth

追溯,我们看到self.c = &self.doc.c[offset] :

cdef class Token:
"""
An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
"""
def __cinit__(self, Vocab vocab, Doc doc, int offset):
self.vocab = vocab
self.doc = doc
self.c = &self.doc.c[offset]
self.i = offset

没有详尽的文档,我们真的不知道什么 self.c意味着但从它的外观来看,它正在访问 &self.doc 中的 token 之一。指向 Doc doc 的引用被传递到 __cinit__功能。所以很可能,这是访问 token 的捷径

查看 Doc.c :

cdef class Doc:
def __init__(self, Vocab vocab, words=None, spaces=None, orths_and_spaces=None):
self.vocab = vocab
size = 20
self.mem = Pool()
# Guarantee self.lex[i-x], for any i >= 0 and x < padding is in bounds
# However, we need to remember the true starting places, so that we can
# realloc.
data_start = <TokenC*>self.mem.alloc(size + (PADDING*2), sizeof(TokenC))
cdef int i
for i in range(size + (PADDING*2)):
data_start[i].lex = &EMPTY_LEXEME
data_start[i].l_edge = i
data_start[i].r_edge = i
self.c = data_start + PADDING

现在我们看到 Doc.c指的是 cython 指针数组 data_start分配内存以存储 spacy.tokens.doc.Doc对象(如果我的解释 <TokenC*> 错误,请纠正我)。

所以回到self.c = &self.doc.c[offset] ,它基本上是在尝试访问存储数组的内存点,更具体地说,是访问数组中的“第 offset-th”项。

这就是spacy.tokens.token.Token是。


回到property :

property orth:
def __get__(self):
return self.c.lex.orth

我们看到 self.c.lex正在访问 data_start[i].lex from spacy.tokens.doc.Doc self.c.lex.orth只是一个整数,指示保留在 spacy.tokens.doc.Doc 中的单词出现的索引内部词汇。

因此,我们看到了 property orth_尝试访问 self.vocab.strings索引来自 self.c.lex.orth https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162

property orth_:
def __get__(self):
return self.vocab.strings[self.c.lex.orth]

关于python - [ orth , pos , tag , lema 和 text ] 的 spaCy 文档,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43990617/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com