python - 如果服务器突然崩溃，在训练 CNN 时保存、恢复并继续更新学习曲线

Question

我正在远程服务器上使用 TensorFlow 训练深度学习模型。问题是我一次只分配了 2 个小时的培训，并且服务器可能由于各种原因在任何时候崩溃。

我知道我的模型训练至少需要 48 小时才能完成。我希望能够在模型完全训练（48 小时以上）后显示从开始到结束的训练曲线，中间没有中断。

我可以使用回调（保存最佳权重）选择训练最后一次崩溃时的位置，但我不确定如何通过训练曲线（损失 + 准确性）来实现这一点。

非常感谢您的帮助。

score 0 · Accepted Answer

Tensorboard 会自动存档，或者您可以重置它，但您可以使用这些值指示的文本格式或使用摘要来记录文件。

history = model.fit(batched_features, epochs=1 ,validation_data=(batched_features) callbacks=[custom_callback, tb_callback])通过 history.history['loss'], history.history['accuracy'][0]使用 matlibplot 轻松绘图和公共摘要。

您也可以在 tf.summary 中使用合并摘要或损失摘要。

您可以从回调或回调纪元返回访问历史记录。


1. history = model_highscores.fit(batched_features, epochs=1000 ,validation_data=(dataset.shuffle(len(list_image))), callbacks=[custom_callback])
print(history.params)                   # {'verbose': 1, 'epochs': 100, 'steps': 2}
print(history.history.keys())           # dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])

temp = history.history['loss'][:-1]
plt.plot(temp, history.history['accuracy'])
plt.show()


2. loss_summary = tf.summary.scalar('loss', 0.5)


3. callback
def _val_writer(self):
        if 'val' not in self._writers:
            self._writers['val'] = tf.summary.create_file_writer(val_dir)
        return self._writers['val']
def on_epoch_end(self, epoch, logs={}):
        print(self.model.inputs) 
        feature_extractor = tf.keras.Model(inputs=self.model.inputs, outputs=[layer.output for layer in self.model.layers], )
        x = tf.ones((32, 32, 3))
        print(np.asarray(feature_extractor))

...

python - 如果服务器突然崩溃，在训练 CNN 时保存、恢复并继续更新学习曲线

1 回答 1

Related

Reference