gpt4 book ai didi

Efficiently computing a batch of results given a batch assignment vector and series of corresponding matrices(在给定批次赋值向量和一系列对应矩阵的情况下,高效地计算一批结果)

翻译 作者:bug小助手 更新时间:2023-10-26 21:40:01 25 4
gpt4 key购买 nike



I have a 1D tensor of tokens that belong to different batches. The batch sizes here are uneven. Each batch needs to be multiplied with a corresponding weight matrix. My current approach is using a batch pointer vector and a series of distinct weight matrices corresponding to the unique pointers along with a for loop. I want to efficiently compute a result of shape [num_tokens, output_dim], where each weight matrix has shape [input_dim, output_dim]. I also pad the inputs to a multiple of 8 for harnessing NVIDIA Tensor Cores.
Here's an example:

我有属于不同批次的令牌的一维张量。这里的批次大小参差不齐。每批需要乘以相应的权重矩阵。我目前的方法是使用批处理指针向量和一系列与唯一指针对应的不同权重矩阵以及for循环。我想高效地计算形状[num_tokens,outputdim]的结果,其中每个权重矩阵都有形状[inputdim,outputdim]。我还将输入填充到8的倍数,以利用NVIDIA张量器内核。下面是一个例子:


# shape [num_tokens,]
input_dim, output_dim = 4, 8
ptr = torch.tensor([0, 1, 1, 2, 2, 2, 3, 3, 3, -1, -1, -1]) # -1 means padding
features = torch.randn(ptr.shape[0], input_dim)
weights = [torch.randn(input_dim, output_dim) for _ in range(4)]

unique = torch.unique(ptr, sorted=False, return_inverse=False, return_counts=False)
unique = unique[unique != -1] # ignore padding

results = []

for i in unique:
split = features[batch_ptr == i, :]
# pad each split to multiple of 8 for NVIDIA A100
# repeat pad embedding to desired size
pad = (
torch.empty((-split.size(0)) % 8, split.size(-1))
.uniform_()
.to(split.device)
)
padded_split = torch.cat((split, pad), dim=0)
attn_mask = torch.cat((torch.ones(split.size(0)), torch.zeros(pad.size(0)))).to(
torch.bool
)

# forward pass
result = split @ padded_split
# strip padding so I can create a 2D result tensor of correct dimension again
results.append(result[attn_mask, :])

results = torch.cat(results, dim=0)

The above approach causes a significant slowdown in my code, and occurs in the forward pass of model inference. I suspect that it's because of the padding. I was looking into scatter operations as a solution using ptr as an index vector, but all available methods only support basic reduction methods like sum, mean, max, etc.

上面的方法导致我的代码显著减慢,并且发生在模型推理的前向传递中。我怀疑是因为填充物的缘故。我正在研究使用ptr作为索引向量的散布操作作为解决方案,但所有可用的方法都只支持像SUM、Mean、Max等基本的归约方法。


How to optimize my approach?

如何优化我的方法?


更多回答

Welcome to SO; code optimization questions are off-topic here, and they should be posted to Code Review SE instead. Please see Where should code optimization questions be asked?

欢迎来到So;代码优化问题在这里是离题的,它们应该发布到Code Review SE。请看应该在哪里询问代码优化问题?

@desertnaut what are you on about? meta.stackoverflow.com/q/412875

@discartnaut你在说什么?Meta.stackoverflow.com/Q/412875

优秀答案推荐
更多回答

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com