python - pytorch 使用 CudnnBatchNormBackward 分布式 loss.backward 错误，就地操作

Question

我有一些简单的模型代码，从 resnet 学习的传输，当我在没有分布式的情况下运行它时，一切正常。然而，当我在分布式模式下尝试它时，我得到了这个奇怪的错误：

Error detected in CudnnBatchNormBackward.

接着：

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 4; expected version 3 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

有问题的模型是一个普通的 resnet，它的加载方式如下：

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        backbone = models.resnet50(pretrained=True)
        self.backbone = torch.nn.SyncBatchNorm.convert_sync_batchnorm(backbone)
        self.backbone.fc = torch.nn.Identity()
    def forward(self, x):
        y1 = self.backbone(x)
        y_n = F.normalize(z1, dim=-1)
        return y_n

训练循环如下所示：

    for batch_idx, ((img1, img2), _) in enumerate(train_loader):

        if args.gpu is not None:
            img1 = img1.cuda(args.gpu, non_blocking=True)
            img2 = img2.cuda(args.gpu, non_blocking=True)

        optimizer.zero_grad()

        out_1 = model(img1)
        out_2 = model(img2)

        loss = contrastive_loss_fn(out1,out2)

        loss.backward()

        optimizer.step()

它loss.backward本质上是错误的。我怀疑这是因为我运行了两次模型，但我不确定。

任何指针都会很棒.. 好几天没能解决这个问题了！

PS：我已经尝试克隆输出并使用 SyncedBatchnorm .. 似乎都没有帮助！

python - pytorch 使用 CudnnBatchNormBackward 分布式 loss.backward 错误，就地操作

0 回答 0

Related

Reference