gpt4 book ai didi

python - 如何使用 Tensorflow 中的 Hugging Face Transformers 库对自定义数据进行文本分类?

转载 作者:行者123 更新时间:2023-12-03 14:39:10 28 4
gpt4 key购买 nike

我正在尝试使用 Hugging Face 'Transformers' 库提供的不同转换器架构对自定义数据(采用 csv 格式)进行二进制文本分类。我正在使用这个 Tensorflow blog post作为引用。

我正在使用以下代码将自定义数据集加载为“tf.data.Dataset”格式:

def get_dataset(file_path, **kwargs):
dataset = tf.data.experimental.make_csv_dataset(
file_path,
batch_size=5, # Artificially small to make examples easier to show.
na_value="",
num_epochs=1,
ignore_errors=True,
**kwargs)
return dataset

在此之后,当我尝试使用 'glue_convert_examples_to_features'标记化方法如下:
train_dataset = glue_convert_examples_to_features(
examples = train_data,
tokenizer = tokenizer,
task = None,
label_list = ['0', '1'],
max_length = 128
)

在以下位置引发错误“UnboundLocalError:分配前引用的局部变量‘处理器’”:
 if is_tf_dataset:
example = processor.get_example_from_tensor_dict(example)
example = processor.tfds_map(example)

在所有示例中,我看到他们正在使用诸如“mrpc”之类的任务,这些任务是预定义的并且有一个glue_processor 来处理。在 source code 中的“第 85 行”处引发错误.

任何人都可以使用“自定义数据”帮助解决此问题吗?

最佳答案

我有同样的启动问题。
Kaggle submission帮了我很多。在那里,您可以看到如何根据所选的预训练模型对数据进行标记:

from transformers import BertTokenizer
from keras.preprocessing.sequence import pad_sequences

bert_model_name = 'bert-base-uncased'

tokenizer = BertTokenizer.from_pretrained(bert_model_name, do_lower_case=True)
MAX_LEN = 128

def tokenize_sentences(sentences, tokenizer, max_seq_len = 128):
tokenized_sentences = []

for sentence in tqdm(sentences):
tokenized_sentence = tokenizer.encode(
sentence, # Sentence to encode.
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
max_length = max_seq_len, # Truncate all sentences.
)

tokenized_sentences.append(tokenized_sentence)

return tokenized_sentences

def create_attention_masks(tokenized_and_padded_sentences):
attention_masks = []

for sentence in tokenized_and_padded_sentences:
att_mask = [int(token_id > 0) for token_id in sentence]
attention_masks.append(att_mask)

return np.asarray(attention_masks)

input_ids = tokenize_sentences(df_train['comment_text'], tokenizer, MAX_LEN)
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", value=0, truncating="post", padding="post")
attention_masks = create_attention_masks(input_ids)
之后,您应该拆分 ID 和掩码:
from sklearn.model_selection import train_test_split

labels = df_train[label_cols].values

train_ids, validation_ids, train_labels, validation_labels = train_test_split(input_ids, labels, random_state=0, test_size=0.1)
train_masks, validation_masks, _, _ = train_test_split(attention_masks, labels, random_state=0, test_size=0.1)

train_size = len(train_inputs)
validation_size = len(validation_inputs)
此外,我查看了 sourceglue_convert_examples_to_features .在那里你可以看到 tf.data.dataset可以创建与 BERT 模型兼容的模型。我为此创建了一个函数:
def create_dataset(ids, masks, labels):
def gen():
for i in range(len(train_ids)):
yield (
{
"input_ids": ids[i],
"attention_mask": masks[i]
},
labels[i],
)

return tf.data.Dataset.from_generator(
gen,
({"input_ids": tf.int32, "attention_mask": tf.int32}, tf.int64),
(
{
"input_ids": tf.TensorShape([None]),
"attention_mask": tf.TensorShape([None])
},
tf.TensorShape([None]),
),
)

train_dataset = create_dataset(train_ids, train_masks, train_labels)
然后我像这样使用数据集:
from transformers import TFBertForSequenceClassification, BertConfig

model = TFBertForSequenceClassification.from_pretrained(
bert_model_name,
config=BertConfig.from_pretrained(bert_model_name, num_labels=20)
)

# Prepare training: Compile tf.keras model with optimizer, loss and learning rate schedule
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.CategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

# Train and evaluate using tf.keras.Model.fit()
history = model.fit(train_dataset, epochs=1, steps_per_epoch=115, validation_data=val_dataset, validation_steps=7)

关于python - 如何使用 Tensorflow 中的 Hugging Face Transformers 库对自定义数据进行文本分类?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59978959/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com