gpt4 book ai didi

machine-learning - 在 pytorch 中进行第一次 epoch 训练后系统挂起

转载 作者:行者123 更新时间:2023-12-04 11:46:31 27 4
gpt4 key购买 nike

因此,我尝试使用 GitHub 存储库中的 ImageNet 示例在 PyTorch 中训练 ResNet 模型。

这是我的火车方法的样子(它几乎与示例中的相似)

def train(train_loader, model, criterion, optimizer, epoch):
batch_time = AverageMeter()
data_time = AverageMeter()
losses = AverageMeter()
top1 = AverageMeter()
top5 = AverageMeter()

args = get_args()

# switch to train mode
model.train()

end = time.time()

for i, (input, target) in enumerate(train_loader):
print(i)
# data loading time
data_time.update(time.time() - end)

if cuda:
target = target.cuda(async = True)
input_var = torch.autograd.Variable(input).cuda()
else:
input_var = torch.autograd.Variable(input)

target_var = torch.autograd.Variable(target)

# compute output
output = model(input_var)
loss = criterion(output, target_var)

# measure accuracy and record loss
prec1, prec5 = accuracy(output.data, target, topk=(1, 5))
losses.update(loss.item(), input.size(0))
top1.update(prec1.item(), input.size(0))
# top5.update(prec5.item(), input.size(0))

# compute gradient and do optimizer step
optimizer.zero_grad()
loss.backward()
optimizer.step()

#measure elapsed time
batch_time.update(time.time() - end)
end = time.time()

# print to console and write logs to tensorboard
if i % args.print_freq == 0:
print('Epoch: [{0}][{1}/{2}]\t'
'Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
'Data {data_time.val:.3f} ({data_time.avg:.3f})\t'
'Loss {loss.val:.4f} ({loss.avg:.4f})\t'
'Prec@1 {top1.val:.3f} ({top1.avg:.3f})\t'.format(
epoch, i, len(train_loader), batch_time=batch_time,
data_time=data_time, loss=losses, top1=top1, top5=top5))
niter = epoch * len(train_loader) + i
# writer.add_scalar('Train/Loss', losses.val, niter)
# writer.add_scalar('Train/Prec@1', top1.val, niter)
# writer.add_scalar('Train/Prec@5', top5.val, niter)

系统信息:
GPU:英伟达 Titan XP
内存:32 Gb

PyTorch:0.4.0

当我运行此代码时,训练从 epoch 0 开始
Epoch: [0][0/108]   Time 5.644 (5.644)  Data 1.929 (1.929)  Loss 6.9052 (6.9052)    Prec@1 0.000 (0.000)

然后远程服务器自动断开连接。它发生了五次。

这是数据加载器:
#Load the Data --> TRAIN
traindir = 'train'
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
train_dataset = datasets.ImageFolder(traindir, transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
normalize,
]))
train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=args.batch_size, shuffle=True, num_workers=args.num_workers,
pin_memory=cuda
)

# Load the data --> Validation
valdir = 'valid'
valid_loader = torch.utils.data.DataLoader(
datasets.ImageFolder(valdir, transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
normalize,
])),
batch_size=args.batch_size, shuffle=False, num_workers=args.num_workers,
pin_memory=cuda
)

if args.evaluate:
validate(valid_loader, model, criterion, epoch=0)
return

# Start
for epoch in range(args.start_epoch, args.epochs):
adjust_learning_rate(optimizer, epoch)

# train for epoch
train(train_loader, model, criterion, optimizer, epoch)

# evaluate on valid
prec1 = validate(valid_loader, model, criterion, epoch)

# remember best prec1 and save checkpoint
is_best = prec1 > best_prec1
best_prec1 = max(prec1, best_prec1)
save_checkpoint({
'epoch': epoch + 1,
'arch': args.arch,
'state_dict': model.state_dict(),
'best_prec1': best_prec1,
'optimizer': optimizer.state_dict()
}, is_best)

使用此加载器参数:
args.num_workers = 4
args.batch_size = 32
pin_memory = torch.cuda.is_available()

我的方法有问题吗?

最佳答案

似乎是 pytorch 的数据加载器中的一个错误。
尝试 args.num_workers = 0

关于machine-learning - 在 pytorch 中进行第一次 epoch 训练后系统挂起,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50911880/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com