gpt4 book ai didi

python - 合并多个 BatchEncoding 或从 BatchEncoding 对象列表创建 tensorflow 数据集

转载 作者:行者123 更新时间:2023-12-05 04:26:38 28 4
gpt4 key购买 nike

在标记标记任务中,我使用了一个转换器标记器,它输出 BatchEncoding 类的对象。我正在分别标记每个文本,因为我需要从文本中提取标签并在标记化后重新排列它们(由于子标记)。但是,我找不到从 BatchEncoding 对象列表创建 tensorflow 数据集或将所有 BatchEncoding 对象合并为一个来创建数据集的方法。

以下是代码的主要部分:

tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-uncased')

def extract_labels(raw_text):
# split text into words and extract label
(...)
return clean_words, labels


def tokenize_text(words, labels):

# tokenize text
tokens = tokenizer(words, is_split_into_words=True, padding='max_length', truncation=True, max_length=MAX_LENGTH)

# since words might be split into subwords, labels need to be re-arranged
# only the first subword has the label
(...)
tokens['labels'] = label_ids

return tokens



tokens = []
for raw_text in data:
clean_text, l = extract_labels(raw_text)
t = tokenize_text(clean_text, l)
tokens.append(t)


type(tokens[0])
# transformers.tokenization_utils_base.BatchEncoding
tokens[0]
# {'input_ids': [101, 69887, 10112, ..., 0, 0, 0], 'attention_mask': [1, 1, 1, ... 0, 0, 0], 'labels': [-100, 0, -100, ..., -100, -100, -100]}

更新,根据要求,重现一个基本示例:

from transformers import BertTokenizerFast
import tensorflow as tf
tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-uncased')
tokens = []
for text in ["Hello there", "Good morning"]:
t = tokenizer(text.split(), is_split_into_words=True, padding='max_length', truncation=True, max_length=10)
t['labels'] = list(map(lambda x: 1, t.word_ids())) # fake labels to simplify example
tokens.append(t)

print(type(tokens[0])) # now tokens is a list of BatchEncodings
print(tokens)

如果我直接标记整个数据集,我将拥有一个包含所有内容的 BatchEnconding,但我将无法处理标签:

data = ["Hello there", "Good morning"]
tokens = tokenizer(data, padding='max_length', truncation=True, max_length=10)
# now tokens is a batch encoding comprising all the dataset
print(type(tokens))
print(tokens)
# This way I can get a tf dataset like this:
tf.data.Dataset.from_tensor_slices(tokens)

请注意,我需要先迭代文本以获取标签,并且我需要每个文本的 word_ids() 来重新排列标签。

最佳答案

您有几个选择。您可以使用 defaultdict:

from collections import defaultdict
import tensorflow as tf

result = defaultdict(list)
for d in tokens:
for k, v in d.items():
result[k].append(v)

dataset = tf.data.Dataset.from_tensor_slices(dict(result))

或者您可以使用 pandas,如图所示 here :

import pandas as pd
import tensorflow as tf

dataset = tf.data.Dataset.from_tensor_slices(pd.DataFrame.from_dict(tokens).to_dict(orient="list"))

或者在预处理数据时创建正确的结构:

from transformers import BertTokenizerFast
from collections import defaultdict
import tensorflow as tf

tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-uncased')
tokens = defaultdict(list)
for text in ["Hello there", "Good morning"]:
t = tokenizer(text.split(), is_split_into_words=True, padding='max_length', truncation=True, max_length=10)
tokens['input_ids'].append(t['input_ids'])
tokens['token_type_ids'].append(t['token_type_ids'])
tokens['attention_mask'].append(t['attention_mask'])
t['labels'] = list(map(lambda x: 1, t.word_ids())) # fake labels to simplify example
tokens['labels'].append(t['labels'])

dataset = tf.data.Dataset.from_tensor_slices(dict(tokens))
for x in dataset:
print(x)
{'input_ids': <tf.Tensor: shape=(10,), dtype=int32, numpy=
array([ 101, 29155, 10768, 102, 0, 0, 0, 0, 0,
0], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(10,), dtype=int32, numpy=array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(10,), dtype=int32, numpy=array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0], dtype=int32)>, 'labels': <tf.Tensor: shape=(10,), dtype=int32, numpy=array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)>}
{'input_ids': <tf.Tensor: shape=(10,), dtype=int32, numpy=
array([ 101, 12050, 17577, 102, 0, 0, 0, 0, 0,
0], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(10,), dtype=int32, numpy=array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(10,), dtype=int32, numpy=array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0], dtype=int32)>, 'labels': <tf.Tensor: shape=(10,), dtype=int32, numpy=array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)>}

关于python - 合并多个 BatchEncoding 或从 BatchEncoding 对象列表创建 tensorflow 数据集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/73024608/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com