gpt4 book ai didi

pytorch - 标记器解码步骤中的标记到单词映射拥抱面?

转载 作者:行者123 更新时间:2023-12-04 11:50:32 25 4
gpt4 key购买 nike

有没有办法知道从 token 到 tokenizer.decode() 中的原始单词的映射?功能?
例如:

from transformers.tokenization_roberta import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('roberta-large', do_lower_case=True)

str = "This is a tokenization example"
tokenized = tokenizer.tokenize(str)
## ['this', 'Ġis', 'Ġa', 'Ġtoken', 'ization', 'Ġexample']

encoded = tokenizer.encode_plus(str)
## encoded['input_ids']=[0, 42, 16, 10, 19233, 1938, 1246, 2]

decoded = tokenizer.decode(encoded['input_ids'])
## '<s> this is a tokenization example</s>'

目标是有一个函数来映射 decode 中的每个标记。处理正确的输入词,因为这里将是: desired_output = [[1],[2],[3],[4,5],[6]]this对应 id 42 , 而 tokenization对应于 ID [19244,1938]位于索引 4,5input_ids大批。

最佳答案

据我所知,他们没有内置的方法,但你可以自己创建一个:

from transformers.tokenization_roberta import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('roberta-large', do_lower_case=True)

example = "This is a tokenization example"

print({x : tokenizer.encode(x, add_special_tokens=False, add_prefix_space=True) for x in example.split()})

输出:
{'This': [42], 'is': [16], 'a': [10], 'tokenization': [19233, 1938], 'example': [1246]}

要准确获得所需的输出,您必须使用列表理解:

#start index because the number of special tokens is fixed for each model (but be aware of single sentence input and pairwise sentence input)
idx = 1

enc =[tokenizer.encode(x, add_special_tokens=False, add_prefix_space=True) for x in example.split()]

desired_output = []

for token in enc:
tokenoutput = []
for ids in token:
tokenoutput.append(idx)
idx +=1
desired_output.append(tokenoutput)

print(desired_output)

输出:
[[1], [2], [3], [4, 5], [6]]

关于pytorch - 标记器解码步骤中的标记到单词映射拥抱面?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62317723/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com