我对 PyTorch 比较陌生,但我对 Keras 和 TensorFlow 有很好的经验。我已经按照这篇文章在我自己的训练脚本中使用了 DDP。但是,由于某种原因,我总是最终得到:
进程 0 以退出状态 1 终止。
我尝试在多个 GPU 平台(Google Colab、Kaggle、Floyd Hub)上运行相同的代码,但它们几乎都给我带来了同样的错误。
我也尝试禁用该join=True选项,但培训过程甚至没有开始。
DDP相关代码:
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
# initialize the process group
dist.init_process_group("gloo", rank=rank, world_size=world_size)
# Explicitly setting seed to make sure that models created in two processes
# start from same random weights and biases.
torch.manual_seed(42)
def cleanup():
dist.destroy_process_group()
def run_demo(fn, *args):
mp.spawn(
fn,
args = (args[0], args[1], args[2], args[3], args[4]),
nprocs = 1, # Also tried 2 , but no difference
join = True
)
还有我的培训代码:
def train(model, X, batch_size = 32, epochs = 75, gradient_acc = 0):
setup(1, 2)
device = model.get_default_device()
model = model.to(device, non_blocking = True)
ddp_model = DDP(model, device_ids = [0]) # Only one GPU
# ...
ddp_model.hidden_enc = ddp_model.init_hidden_enc()
ddp_model.hidden_dec = ddp_model.init_hidden_dec()
ddp_model.train()
for ep in range(epochs):
loss_br = 0; nb_batch_steps = 0
for step, batch in enumerate( data_loader ):
batch = batch.to(device, non_blocking = True)
nb_batch_steps += 1
loss = ddp_model(batch)
# ...
cleanup()
调用训练代码:
if __name__ == "__main__":
run_demo(
train,
model,
holder[:], # X
32,
75,
3
)
我希望模型使用分布式并行包在多个进程上运行。有趣的是,有时我Out of Memory在不使用 DDP 的情况下运行 CUDA 时会出现异常。我知道spawn.py如果存在任何可用进程,则会终止所有进程status code > 1,但我似乎还不知道如何避免这个问题。非常感谢任何帮助。