Faster RCNN loss Nan from the beginning(更快的RCNN从一开始就失去了楠)-6ren

Faster RCNN loss Nan from the beginning(更快的RCNN从一开始就失去了楠)

转载作者：bug小助手更新时间：2023-10-22 16:39:12

I'm using Pytorch with Faster RCNN on dataset having 2 classes and about 100 images for training and 35 for validation in a multi node and multi gpu environment. Just to simplify the debugging, I'm running on a single GPU for the moment, and batch size=1.
The problem is the losses from training is Nan from the beginning, also using other learning rate values. My dataset has COCO format. I checked the training annotations and are ok.

我在数据集上使用Pytorch和Faster RCNN，该数据集有2个类和大约100个图像用于训练，35个图像用于在多节点和多gpu环境中进行验证。为了简化调试，我暂时在一个GPU上运行，批处理大小为1。问题是训练中的损失是楠从一开始，也使用了其他的学习率值。我的数据集具有COCO格式。我检查了培训注释，一切正常。

The following my setup:

以下是我的设置：

train_dataset = CustomCOCODataset(root_dir=train_image_folder, annotations_file=train_annotations_file, transforms=transform)
val_dataset =  CustomCOCODataset(root_dir=val_image_folder, annotations_file=val_annotations_file, transforms=transfor

train_sampler = DistributedSampler(train_dataset)
val_sampler = DistributedSampler(val_dataset)

batch_size = 1
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=(train_sampler is None), sampler=train_sampler)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=(val_sampler is None), sampler=val_sampler))
    
num_train_batches = len(train_dataloader)
num_val_batches = len(val_dataloader)
num_classes=3 #cat dogs

model = fasterrcnn_resnet50_fpn(pretrained_backbone=True, num_classes=num_classes)
model = model.to(local_rank)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
optimizer = torch.optim.SGD(model.parameters(), lr=0.0005)

My training and validation loop:

我的培训和验证循环：

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for i, (images, targets) in enumerate(train_dataloader):
        images = [image.to(local_rank) for image in images]
            
        targets = [{k: v.to(local_rank) for k, v in t.items()} for t in targets]

        loss_dict = model(images, targets)
        losses = sum(loss for loss in loss_dict.values())
        
        losses.backward()  # Calcolare i gradienti, qua viene fatto il broadcast tra i gradienti
        
        optimizer.step()  # Aggiornare i pesi del modello

        running_loss += losses.item()
  
    print(f"Epoch: {epoch} Loss: {running_loss / (i + 1)}")

    cpu_device = torch.device("cpu")
    model.eval()
    target = []
    preds = []
    metric_summary = {}

    for i, (images, targets) in enumerate(val_dataloader):

        images = [image.to(local_rank) for image in images]
    
        with torch.no_grad():
            outputs = model(images)

        #Skip no predictions
        if outputs[0]['boxes'].numel() == 0:
           continue

        #####################################
        for i in range(len(images)):
            true_dict = dict()
            preds_dict = dict()
            true_dict['boxes'] = targets[i]['boxes'].detach().cpu()
            true_dict['labels'] = targets[i]['labels'].detach().cpu()
            preds_dict['boxes'] = outputs[i]['boxes'].detach().cpu()
            preds_dict['scores'] = outputs[i]['scores'].detach().cpu()
            preds_dict['labels'] = outputs[i]['labels'].detach().cpu()
            preds.append(preds_dict)
            target.append(true_dict)
            #####################################

        metric = MeanAveragePrecision()

        # Copia preds e target sulla GPU
        for i in range(len(preds)):
            for key in preds[i]:
                preds[i][key] = preds[i][key].to(local_rank)  # "device" dovrebbe essere la GPU corretta
            for key in target[i]:
                target[i][key] = target[i][key].to(local_rank)

        metric.update(preds, target)
        metric_summary = metric.compute()

    if 'map' in metric_summary:
        print(f"Epoch: {epoch} Metric summary: {metric_summary['map']}")

    if global_rank == 0 :
        orch.save(model.module.state_dict(), f'modello_salvato_cat_dog_epoch_{epoch+1}.pth')
        rint("Modello salvato!")

I did a CustomCocoDataset class, where in get_item i convert boxes as:

我做了一个CustomCocoDataset类，在get_item中，我将框转换为：

torchvision.ops.box_convert(torch.tensor(annotation['bbox']), in_fmt="xywh", out_fmt="xyxy")

Such conversion should be correct, since I start from COCO dataset where the bbox format is xywh.

这样的转换应该是正确的，因为我从COCO数据集开始，其中的bbox格式是xywh。

The following my complete Custom class:

以下是我完整的自定义类：

class CustomCOCODataset(Dataset):
def __init__(self, root_dir, annotations_file, transforms=None):
    self.root_dir = root_dir
    self.transforms = transforms
    self.coco = COCO(annotations_file)
    self.image_ids = list(self.coco.imgs.keys())
    self.image_ids = self.filter_empty_images() 
    self.log_annot_issue_x = True
    self.log_annot_issue_y = True
    self.square_training = False
    self.img_size = 640 #detto il default

def __len__(self):
    return len(self.image_ids)

def filter_empty_images(self):
    # Filtra le immagini senza annotazioni
    valid_image_ids = []
    for img_id in self.image_ids:
        annotation_ids = self.coco.getAnnIds(imgIds=img_id)
        annotations = self.coco.loadAnns(annotation_ids)
        if annotations:
            valid_image_ids.append(img_id)
    return valid_image_ids 

def load_image_and_annotations(self, index):
    img_id = self.image_ids[index]
    img_info = self.coco.loadImgs(img_id)[0]
    img_path = os.path.join(self.root_dir, img_info['file_name'])
    image = cv2.imread(img_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32)
    annotation_ids = self.coco.getAnnIds(imgIds=img_info['id'])

    # Ottieni tutte le annotazioni per questa immagine
    annotations = self.coco.loadAnns(annotation_ids)

    boxes = []
    orig_boxes = []
    labels = []
    image_width = image.shape[1]
    image_height = image.shape[0]


    for annotation in annotations:
        # Estrai le coordinate della bounding box
        xmin = annotation['bbox'][0]
        ymin = annotation['bbox'][1]
        xmax = annotation['bbox'][2]
        ymax = annotation['bbox'][3]

        orig_boxes.append([xmin, ymin, xmax, ymax])
        
        xmax_final = xmax
        ymin_final = ymin
        ymax_final = ymax

        boxes.append([xmin_final, ymin_final, xmax_final, ymax_final])

    return image, annotations

def __getitem__(self, idx):
      
    image, annotations = self.load_image_and_annotations(idx)
    #print("idx: ", idx, "image_size: ", image.size)
   
    if self.transforms:
        image = self.transforms(image)

    target_list = []
            
    target = {
        'boxes': [],
        'labels': [],
        'area': [],
        'iscrowd': []
    }

    image_id = torch.tensor([idx])

    for annotation in annotations:
        target = {
            'image_id' : image_id,
            'boxes': torchvision.ops.box_convert(torch.tensor(annotation['bbox']), in_fmt="xywh", out_fmt="xyxy"),
            'labels' : torch.tensor(annotation['category_id']),
            'area': torch.tensor(annotation['area']),
            'iscrowd': torch.tensor(annotation['iscrowd'])
    }
    target_list.append(target)    

    return image, target_list

Why does this happen and how to resolve it?

为什么会发生这种情况以及如何解决？

更多回答

优秀答案推荐

更多回答

文章推荐： maxMessageLength Flutter grpc(最大消息长度Flutter grpc)

javascript - (NaN != NaN) 和 (NaN !== NaN) 有什么区别？
首先我想说的是，我知道isNaN()和 Number.isNaN()工作。我正在阅读 David Flanagan 的 The Definite Guide，他举例说明了如何检查值是否为 NaN :
javascript - 如何摆脱 NaN/NaN/NaN
在表中，对于 skips day 列，最后一行的默认值始终是单词“last”，它不是数字。现在，结果日期显示为“NaN/NaN/NaN”，有什么方法可以将其替换为 Nil 之类的东西。非常感谢。
Javascript 获取 NaN :NaN:NaN
我正在制作一个网站，如果用户登录，则会为用户提供一定的注销时间，其中定义了注销时间，剩余时间是从注销时间 - 服务器时间获得的。我已经通过 PHP 获得了注销时间和服务器时间，但我想动态显示剩余时间
ios - 什么可能导致此 "Fatal Exception: CALayerInvalidGeometry CALayer bounds contains NaN: [nan nan; nan nan]"崩溃？
我有以下代码，它简单地初始化一个 UIImageView 以适应 UIImage 在当前屏幕尺寸上尽可能大的比例: CGSize mainScreenSize = [appDelegate mainS
python - 为什么 (nan,)==(nan,) 为 True，而 nan==nan 为 False？
这个问题已经有答案了: Why in numpy `nan == nan` is False while nan in [nan] is True? (1 个回答) 已关闭 3 年前。我只是觉得这有
javascript - 将 JqGrid 列模式显示为日期和超链接显示 NAN/NAN/NAN
我有动态 JQGrid，其中一列是日期列。我从包含 URL 和日期的 feed 中获取数据。我需要为“日期列”开发列模型，使其显示日期和超链接。但不幸的是，数据显示为 NAN/NAN/NAN (这可
java - map(NaN) 返回 NaN 但我无法调试 NaN
我已经包含了一个演示我的问题的片段。基本上处理给了我这个错误: 调用map(NaN, -3, 3, -125, 125)，返回NaN(不是数字) 我理解此消息的方式是，map 函数返回 NaN，并且由
javascript - 过滤日期在 AngularJS 中返回 NaN-NaN-NaN
我在下面创建的过滤器适用于 Chrome，但不适用于 Firefox。我不明白为什么。 myApp.filter('dateCustom', [ '$filter', function ($fil
python - 为什么在 numpy `nan == nan` 中为 False 而 [nan] 中的 nan 为 True？
虽然问题的第一部分(在标题中)之前已经回答过几次(即 Why is NaN not equal to NaN? )，但我不明白为什么第二部分会以它的方式工作(受此启发问题 How to Check l
c# - 如何使用泛型测试 NaN(或者为什么 NaN.Equals(NaN) == true)？
我需要在数组中找到min和max值(不考虑可能的NaN值在这个数组中)。这只使用 double 会很容易，但是这些 FindMin 和 FindMax 函数必须使用泛型类型。我尝试以这种方式测
ios - 'CALayer 位置包含 NaN : [nan nan]' on UIScrollView
我正在开发一个屏幕，其中 UIScrollView 内只有一个 UIImageView。 UIScrollView 使用户能够固定和缩放图像。我从下面的帖子中得到了帮助。它使用 Storyboard和
ios - Swift:CALayer 边界包含 NaN:[nan nan;南南]？
尽管看到了类似的答案，但我不知道这里发生了什么。我制作了一个自定义的 UIImageview，它应该在创建后立即开始动画: class HeaderAnimator: UIImageView {
python - Pandas :用下一个非 NaN/# 连续 NaN 填充 NaN
我正在寻找一个 pandas 系列并用下一个数值的平均值填充 NaN，其中:average = next numerical value/(# consecutive NaNs + 1) 到目前为止，
javascript - jQuery UI $.datepicker.formatDate 返回 NaN NaN Nan
我有一个 mySql 表，其中有一个名为 posts 的列，该列设置为 timestamp 类型，默认为 current_timestamp。然后，我使用 php PDO 获取它的值(以及其他一些列)
c++ - NAN 差异 - std::nan 与 quiet_NaN() 与宏 NAN
我想知道以下类型的 nan 之间有什么区别。除了 NAN_macro (计算结果为 -nan(ind) 而不是 nan )的视觉差异外，它们的行为似乎都相同(根据下面的示例脚本)。我看了一些其他的答
javascript - 如何解析 NaN :NaN:NaN error in Mozilla Firefox and IE
我为我的网页做了倒计时；它在除 Mozilla 和 IE 之外的所有浏览器上都能正常工作。我做错了什么，我该如何解决？下面是我的代码: ***var dt = '2018-06-14 11:59
ios - 调试 'CALayer position contains NaN: [nan nan]'
在将 Xcode 更新到 8.3 后，我在启动时开始收到此错误:由于未捕获的异常“CALayerInvalidGeometry”而终止应用程序，原因:“CALayer 位置包含 NaN:[nan na
javascript - 如果 Date 为空格式 date() 返回 NaN/NaN/NaN 而不是没有值
我正在使用 jquery 自动完成 onselect 它在不同的文本字段中显示数据。我使用 format_date() 函数在 #dob 和 #anniversery 中显示格式化日期 select:
javascript - IE8 中 Ext JS 网格日期值的 NaN.NaN.NaN 值
我有一个带有 json Store 和 DateField 的网格。 Firefox 运行良好，但在 Internet Explorer 8 中无法运行。我这样定义: function conver
objective-c - 如何解决CALayerInvalidGeometry'，原因: 'CALayer position contains NaN: [nan nan]?
我有一个错误，它在启动时使应用程序崩溃。这是我得到的错误: *** Terminating app due to uncaught exception 'CALayerInvalidGeometry'

bug小助手

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

Faster RCNN loss Nan from the beginning(更快的RCNN从一开始就失去了楠)