pytorch - 标记器解码步骤中的标记到单词映射拥抱面？-6ren

pytorch - 标记器解码步骤中的标记到单词映射拥抱面？

转载作者：行者123 更新时间：2023-12-04 11:50:32

25

4

有没有办法知道从 token 到 tokenizer.decode() 中的原始单词的映射？功能？
例如:

from transformers.tokenization_roberta import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('roberta-large', do_lower_case=True)

str = "This is a tokenization example"
tokenized = tokenizer.tokenize(str) 
## ['this', 'Ġis', 'Ġa', 'Ġtoken', 'ization', 'Ġexample']

encoded = tokenizer.encode_plus(str) 
## encoded['input_ids']=[0, 42, 16, 10, 19233, 1938, 1246, 2]

decoded = tokenizer.decode(encoded['input_ids']) 
## '<s> this is a tokenization example</s>'

目标是有一个函数来映射 decode 中的每个标记。处理正确的输入词，因为这里将是: desired_output = [[1],[2],[3],[4,5],[6]]如 this对应 id 42 , 而 token和 ization对应于 ID [19244,1938]位于索引 4,5的 input_ids大批。

最佳答案

据我所知，他们没有内置的方法，但你可以自己创建一个:

from transformers.tokenization_roberta import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('roberta-large', do_lower_case=True)

example = "This is a tokenization example"

print({x : tokenizer.encode(x, add_special_tokens=False, add_prefix_space=True) for x in example.split()})

输出:

{'This': [42], 'is': [16], 'a': [10], 'tokenization': [19233, 1938], 'example': [1246]}

要准确获得所需的输出，您必须使用列表理解:

#start index because the number of special tokens is fixed for each model (but be aware of single sentence input and pairwise sentence input)
idx = 1

enc =[tokenizer.encode(x, add_special_tokens=False, add_prefix_space=True) for x in example.split()]

desired_output = []

for token in enc:
    tokenoutput = []
    for ids in token:
      tokenoutput.append(idx)
      idx +=1
    desired_output.append(tokenoutput)

print(desired_output)

输出:

[[1], [2], [3], [4, 5], [6]]

关于pytorch - 标记器解码步骤中的标记到单词映射拥抱面？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62317723/

25

4

0

文章推荐： apache-kafka - 中止 kafka 重新分配分区操作

文章推荐： r - 为什么要使用 `all_of` 来选择列？

haskell - 拥抱 `/` 与类型推断
在 GHCi 中，以下代码可以正常工作: f1 :: Float f1 = f2 -- f2 :: Float f2 = 1/1 但是在 Hugs 中，我得到了一个类型错误——它想成为一个 Doubl
haskell - 拥抱!!部分应用程序错误
拥抱似乎有几个不抱的问题!!在部分应用程序中。虽然这在 GHCi 中运行良好: ([[0]]!!0!!)0 Hugs 报告 ) 的语法错误。 . 这是拥抱中的错误吗？为第二个列表索引运算符添加一个
haskell - 拥抱，功能及其计算方式
我有一个输入： [ 8 `div` 2 + 1 .. ] !! 2 : [ 1 .. 3 ] 输出为： [7,1,2,3] 但是.. Haskell首先计算什么？我不知道优先级，哪七个是哪里来的
厚积薄发,拥抱.NET 2016
初识 .NET 2016 .NET 2016 概览 .NET 2016 作为 .NET 技术最新发展，如下图所示，它主要包含三大块：最左边代表的是 .NET Framework 4.6，WP
haskell - 拥抱、Yhc 和 GHCi 之间的差异
关闭。这个问题需要更多 focused .它目前不接受答案。想改进这个问题？更新问题，使其仅关注一个问题 editing this post . 4年前关闭。 Improve this questi

首页

博学

6Ren·AI

商城

pytorch - 标记器解码步骤中的标记到单词映射拥抱面？