gpt4 book ai didi

python - 如何确保 PyTorch 已释放 GPU 内存?

转载 作者:行者123 更新时间:2023-12-04 09:18:06 25 4
gpt4 key购买 nike

假设我们有这样的函数:

        def trn_l(totall_lc, totall_lw, totall_li, totall_lr):
self.model_large.cuda()
self.model_large.train()
self.optimizer_large.zero_grad()

for fb in range(self.fake_batch):
val_x, val_y = next(self.valid_loader)
val_x, val_y = val_x.cuda(), val_y.cuda()

logits_main, emsemble_logits_main = self.model_large(val_x)
cel = self.criterion(logits_main, val_y)
loss_weight = cel / (self.fake_batch)
loss_weight.backward(retain_graph=False)
cel = cel.cpu().detach()
emsemble_logits_main = emsemble_logits_main.cpu().detach()

totall_lw += float(loss_weight.item())
val_x = val_x.cpu().detach()
val_y = val_y.cpu().detach()

loss_weight = loss_weight.cpu().detach()
self._clip_grad_norm(self.model_large)
self.optimizer_large.step()
self.model_large.train(mode=False)
self.model_large = self.model_large.cpu()
return totall_lc, totall_lw, totall_li, totall_lr
在第一次调用时,它会分配 8GB 的​​ GPU 内存。在下一次调用中,没有分配新的内存,但仍然占用了 8GB。我希望在调用它并产生第一个结果后分配 0 个 GPU 内存或尽可能低。
我试过的:做 retain_graph=False.cpu().detach()无处不在 - 没有积极的影响。
之前的内存快照
|===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 33100 KB | 33219 KB | 40555 KB | 7455 KB |
| from large pool | 3072 KB | 3072 KB | 3072 KB | 0 KB |
| from small pool | 30028 KB | 30147 KB | 37483 KB | 7455 KB |
|---------------------------------------------------------------------------|
| Active memory | 33100 KB | 33219 KB | 40555 KB | 7455 KB |
| from large pool | 3072 KB | 3072 KB | 3072 KB | 0 KB |
| from small pool | 30028 KB | 30147 KB | 37483 KB | 7455 KB |
|---------------------------------------------------------------------------|
| GPU reserved memory | 51200 KB | 51200 KB | 51200 KB | 0 B |
| from large pool | 20480 KB | 20480 KB | 20480 KB | 0 B |
| from small pool | 30720 KB | 30720 KB | 30720 KB | 0 B |
|---------------------------------------------------------------------------|
| Non-releasable memory | 18100 KB | 20926 KB | 56892 KB | 38792 KB |
| from large pool | 17408 KB | 18944 KB | 18944 KB | 1536 KB |
| from small pool | 692 KB | 2047 KB | 37948 KB | 37256 KB |
|---------------------------------------------------------------------------|
| Allocations | 12281 | 12414 | 12912 | 631 |
| from large pool | 2 | 2 | 2 | 0 |
| from small pool | 12279 | 12412 | 12910 | 631 |
|---------------------------------------------------------------------------|
| Active allocs | 12281 | 12414 | 12912 | 631 |
| from large pool | 2 | 2 | 2 | 0 |
| from small pool | 12279 | 12412 | 12910 | 631 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 16 | 16 | 16 | 0 |
| from large pool | 1 | 1 | 1 | 0 |
| from small pool | 15 | 15 | 15 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 3 | 30 | 262 | 259 |
| from large pool | 1 | 1 | 1 | 0 |
| from small pool | 2 | 29 | 261 | 259 |
|===========================================================================|
在调用函数之后
torch.cuda.empty_cache()
torch.cuda.synchronize()
我们得到:
|===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 10957 KB | 8626 MB | 272815 MB | 272804 MB |
| from large pool | 0 KB | 8596 MB | 272477 MB | 272477 MB |
| from small pool | 10957 KB | 33 MB | 337 MB | 327 MB |
|---------------------------------------------------------------------------|
| Active memory | 10957 KB | 8626 MB | 272815 MB | 272804 MB |
| from large pool | 0 KB | 8596 MB | 272477 MB | 272477 MB |
| from small pool | 10957 KB | 33 MB | 337 MB | 327 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory | 8818 MB | 9906 MB | 19618 MB | 10800 MB |
| from large pool | 8784 MB | 9874 MB | 19584 MB | 10800 MB |
| from small pool | 34 MB | 34 MB | 34 MB | 0 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory | 5427 KB | 3850 MB | 207855 MB | 207850 MB |
| from large pool | 0 KB | 3850 MB | 207494 MB | 207494 MB |
| from small pool | 5427 KB | 5 MB | 360 MB | 355 MB |
|---------------------------------------------------------------------------|
| Allocations | 3853 | 13391 | 34339 | 30486 |
| from large pool | 0 | 557 | 12392 | 12392 |
| from small pool | 3853 | 12838 | 21947 | 18094 |
|---------------------------------------------------------------------------|
| Active allocs | 3853 | 13391 | 34339 | 30486 |
| from large pool | 0 | 557 | 12392 | 12392 |
| from small pool | 3853 | 12838 | 21947 | 18094 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 226 | 226 | 410 | 184 |
| from large pool | 209 | 209 | 393 | 184 |
| from small pool | 17 | 17 | 17 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 46 | 358 | 12284 | 12238 |
| from large pool | 0 | 212 | 7845 | 7845 |
| from small pool | 46 | 279 | 4439 | 4393 |
|===========================================================================|

最佳答案

我不认为另一个答案是正确的。分配和解除分配肯定会在运行时发生,需要注意的是 CPU 代码与 GPU 代码异步运行,因此如果您想在它之后保留更多内存,则需要等待任何解除分配发生。看看这个:

import torch 

a = torch.zeros(100,100,100).cuda()

print(torch.cuda.memory_allocated())

del a
torch.cuda.synchronize()
print(torch.cuda.memory_allocated())
输出
4000256
0
所以你应该 del您不需要的张量并调用 torch.cuda.synchronize()确保在您的 CPU 代码继续运行之前完成释放。
在您的具体情况下,在您的函数之后 trn_l返回时,该函数的本地变量,并且在其他地方没有引用,将与相应的 GPU 张量一起被释放。您需要做的就是通过调用 torch.cuda.synchronize() 等待这一切发生。在函数调用之后。

关于python - 如何确保 PyTorch 已释放 GPU 内存?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63145729/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com