我有一些简单的模型代码,从 resnet 学习的传输,当我在没有分布式的情况下运行它时,一切正常。然而,当我在分布式模式下尝试它时,我得到了这个奇怪的错误:
Error detected in CudnnBatchNormBackward.
接着:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 4; expected version 3 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
有问题的模型是一个普通的 resnet,它的加载方式如下:
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
backbone = models.resnet50(pretrained=True)
self.backbone = torch.nn.SyncBatchNorm.convert_sync_batchnorm(backbone)
self.backbone.fc = torch.nn.Identity()
def forward(self, x):
y1 = self.backbone(x)
y_n = F.normalize(z1, dim=-1)
return y_n
训练循环如下所示:
for batch_idx, ((img1, img2), _) in enumerate(train_loader):
if args.gpu is not None:
img1 = img1.cuda(args.gpu, non_blocking=True)
img2 = img2.cuda(args.gpu, non_blocking=True)
optimizer.zero_grad()
out_1 = model(img1)
out_2 = model(img2)
loss = contrastive_loss_fn(out1,out2)
loss.backward()
optimizer.step()
它loss.backward
本质上是错误的。我怀疑这是因为我运行了两次模型,但我不确定。
任何指针都会很棒.. 好几天没能解决这个问题了!
PS:我已经尝试克隆输出并使用 SyncedBatchnorm .. 似乎都没有帮助!