gpt4 book ai didi

python-3.x - BERT文件嵌入

转载 作者:行者123 更新时间:2023-12-03 16:59:22 25 4
gpt4 key购买 nike

我正在尝试使用BERT进行文档嵌入。我使用的代码是两个来源的组合。我使用BERT Document Classification Tutorial with CodeBERT Word Embeddings Tutorial。下面是代码,我将每个文档的前510个 token 提供给BERT模型。最后,我将K-means聚类应用于这些嵌入,但是每个聚类的成员完全无关。我想知道这怎么可能。也许我的代码有问题。如果您看一下我的代码并告诉我它是否有问题,我将不胜感激。我使用Google colab来运行此代码。

# text_to_embedding function
import torch
from keras.preprocessing.sequence import pad_sequences

def text_to_embedding(tokenizer, model, in_text):
'''
Uses the provided BERT 'model' and 'tokenizer' to generate a vector
representation of the input string, 'in_text'.

Returns the vector stored as a numpy ndarray.
'''

# ===========================
# STEP 1: Tokenization
# ===========================

MAX_LEN = 510

# 'encode' will:
# (1) Tokenize the sentence
# (2) Prepend the '[CLS]' token to the start.
# (3) Append the '[SEP]' token to the end.
# (4) Map tokens to their IDs.
input_ids = tokenizer.encode(
in_text, # sentence to encode.
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
max_length = MAX_LEN, # Truncate all sentences.
#return_tensors = 'pt' # Return pytorch tensors.
)

# Pad our input tokens. Truncation was handled above by the 'encode'
# function, which also makes sure that the '[SEP]' token is placed at the
# end *after* truncating.
# Note: 'pad_sequences' expects a list of lists, but we only have one
# piece of text, so we surround 'input_ids' with an extra set of brackets.
results = pad_sequences([input_ids], maxlen=MAX_LEN, dtype="long",
value=0, truncating="post", padding="post")

# Remove the outer list.
input_ids = results[0]

# Create attention masks.
attn_mask = [int(i > 0) for i in input_ids]

# Cast to tensors.
input_ids = torch.tensor(input_ids)
attn_mask = torch.tensor(attn_mask)

# Add an extra dimension for the "batch" (even though there is only one
# input in this batch)
input_ids = input_ids.unsqueeze(0)
attn_mask = attn_mask.unsqueeze(0)


# ===========================
# STEP 1: Tokenization
# ===========================

# Put the model in evaluation mode--the dropout layers behave differently
# during evaluation.
model.eval()

# Copy the inputs to the GPU
input_ids = input_ids.to(device)
attn_mask = attn_mask.to(device)

# telling the model not to build the backward graph will make this
# a little quicker.
with torch.no_grad():

# Forward pass, returns hidden states and predictions
# This will return the logits rather than the loss because we have
# not provided labels.
outputs = model(
input_ids = input_ids,
token_type_ids = None,
attention_mask = attn_mask)


hidden_states = outputs[2]

#Sentence Vectors
#To get a single vector for our entire sentence we have multiple
#application-dependent strategies, but a simple approach is to
#average the second to last hiden layer of each token producing
#a single 768 length vector.
# `hidden_states` has shape [13 x 1 x ? x 768]

# `token_vecs` is a tensor with shape [? x 768]
token_vecs = hidden_states[-2][0]

# Calculate the average of all ? token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)
# Move to the CPU and convert to numpy ndarray.
sentence_embedding = sentence_embedding.detach().cpu().numpy()

return(sentence_embedding)


from transformers import BertTokenizer, BertModel
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True, # Whether the model returns all hidden-states.
)
model.cuda()

from transformers import BertTokenizer

# Load the BERT tokenizer.
print('Loadin BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

最佳答案

我不知道这是否可以解决您的问题,但这是我的2美分:

  • 您不必计算注意掩码并手动进行填充。看看documentation。只需调用 token 生成器本身即可:

  • results = tokenizer(in_text, max_length=MAX_LEN, truncation=True)
    input_ids = results.input_ids
    attn_mask = results.attention_mask
    # Cast to tensors
    ...
  • 可以使用倒数第二个隐藏层的平均值,而不用使用倒数第二个隐藏层的平均值。或者您也可以使用代表最后一层
  • [CLS]的向量

    关于python-3.x - BERT文件嵌入,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63209960/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com