0

我有一些简单的模型代码,从 resnet 学习的传输,当我在没有分布式的情况下运行它时,一切正常。然而,当我在分布式模式下尝试它时,我得到了这个奇怪的错误:

Error detected in CudnnBatchNormBackward.

接着:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 4; expected version 3 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

有问题的模型是一个普通的 resnet,它的加载方式如下:

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        backbone = models.resnet50(pretrained=True)
        self.backbone = torch.nn.SyncBatchNorm.convert_sync_batchnorm(backbone)
        self.backbone.fc = torch.nn.Identity()
    def forward(self, x):
        y1 = self.backbone(x)
        y_n = F.normalize(z1, dim=-1)
        return y_n

训练循环如下所示:

    for batch_idx, ((img1, img2), _) in enumerate(train_loader):

        if args.gpu is not None:
            img1 = img1.cuda(args.gpu, non_blocking=True)
            img2 = img2.cuda(args.gpu, non_blocking=True)

        optimizer.zero_grad()

        out_1 = model(img1)
        out_2 = model(img2)

        loss = contrastive_loss_fn(out1,out2)

        loss.backward()

        optimizer.step()

loss.backward本质上是错误的。我怀疑这是因为我运行了两次模型,但我不确定。

任何指针都会很棒.. 好几天没能解决这个问题了!

PS:我已经尝试克隆输出并使用 SyncedBatchnorm .. 似乎都没有帮助!

4

0 回答 0