TL;DR - 只有 resnet50 可以按预期进行训练,而高效 b4 和 inception_v3 在某些时候卡住了。
我一直在尝试制作自己的项目并对 Kaggle 进行预测 - 面部关键点检测https://www.kaggle.com/c/facial-keypoints-detection
我使用预训练的 Resnet50 取得了一些不错的结果,并在测试集上排名第五,我尝试通过将模型更改为efficientnet_b4 来做出更准确的预测。
不幸的是,出了点问题。我的训练在训练集和验证集的高损失约 15 个 epoch 后卡住了,训练损失稳定在 16 左右,而 val 损失是出乎意料的。(为了比较,resnet 的训练损失为 0.2,验证损失为 0.08)。
这是模型代码(来自我的 git 中的 src-Models-Adjusted_model.py) - 这是不同模型训练之间唯一改变的东西:
from torchvision import models
import torch.nn as nn
resnet50 = models.resnet50(pretrained=True)
resnet50.fc = nn.Linear(2048, 30)
resnet50.__name__ = "resnet50"
efficientb4 = models.efficientnet_b4(num_classes=30)
efficientb4.__name__ = "efficientb4"
inception_v3 = models.inception_v3(pretrained=True, aux_logits=False)
inception_v3.fc = nn.Linear(2048, 30)
inception_v3.__name__ ="inception_v3"
这是我的训练代码(src-Models-train_model.py):
import torch
import src.data.make_dataset as dataset_py
import Adjusted_models as models
import time
from pathlib import Path
from src.utilities.logger import set_logger
def RMSELoss(pred, y):
return torch.sqrt(torch.mean((pred - y) ** 2))
def RMSELoss_custom(pred, y):
"""
:param pred: the prediction of the model
:param y: the true labels
:return: root-mean-square error without considering the NaN labels
- each NaN's loss is equal to 0 so it won't affect the total loss
- to scale the loss per not-NaN value, we divide it by the not_nan amount and multiply by the total
predictions
"""
not_nan = (dataset_py.batch_size * 30 - y.isnan().sum())
return torch.sqrt(torch.mean((pred - y).nan_to_num() ** 2)) * dataset_py.batch_size * 30 / not_nan
epochs = 150
model = models.efficientnet_b4
model_name = model.__name__
logger = set_logger("./log/" + model_name + "_training.log")
model_dir = Path("./pt/" + model_name + "_model.pt")
learning_rate = 0.01
criterion = RMSELoss_custom
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', verbose=True, patience=5)
gpu = torch.cuda.is_available()
if gpu:
torch.cuda.empty_cache()
model.cuda()
train_losses, val_losses = [], []
val_loss_min = torch.inf
logger.debug(f"-------------------------Train-------------------------")
print("Started training")
for e in range(1, epochs + 1):
start = time.perf_counter()
model.train()
train_loss = 0
for images, labels in dataset_py.train_loader:
if gpu:
images = images.cuda()
labels = labels.cuda()
optimizer.zero_grad()
prediction = model(images)
loss = criterion(prediction, labels)
loss.backward()
optimizer.step()
train_loss += loss.item()
val_loss = 0
with torch.no_grad():
model.eval()
for images, labels in dataset_py.val_loader:
if gpu:
images = images.cuda()
labels = labels.cuda()
prediction = model(images)
loss = criterion(prediction, labels)
val_loss += loss.item()
scheduler.step(val_loss)
train_losses.append(train_loss / len(dataset_py.train_loader))
val_losses.append(val_loss / len(dataset_py.val_loader))
logger.info("Epoch: {}/{} ".format(e, epochs) +
"Training Loss: {:.4f} ".format(train_losses[-1]) +
"Val Loss: {:.4f}".format(val_losses[-1]))
if val_loss < val_loss_min:
val_loss_min = val_loss
torch.save(model.state_dict(), model_dir)
logger.info('---Detected network improvement, saving current model--')
end = time.perf_counter()
total = (end - start) * (epochs - e)
logger.info('----------------Estimated time: {:d}:{:d}:{:d}----------------'.format(int(total // 3600),
int(total % 3600 // 60),
int(total % 60)))
print('Done training!')
logger.info('Done training!')
这是我的 git 存储库,带有完整的源代码https://github.com/Bar-A-94/Facial-keypoints-detection
我想到的不同检查/错误:
- 硬编码 resnet - 我调试了我的训练,但没有发现模型有任何变化。
- 过度拟合——我试图对 5 张图像进行过度拟合——resnet 的训练损失为 0.05,而其他的训练损失则被困在 3 左右。
- 自定义损失效果 - 我尝试采用常规 RMSE 而不是我的自定义 RMSE,结果相同
- 非常宽的局部最小值 - 我也尝试仅采用模型的结构(预训练 = False)并再次获得相同的结果
- 特定模型有问题 - 我试图采用另一个(inception_v3)相同的结果。
- 试图改变初始学习率 - 结果相同