python - 在 deepspeech 内部训练期间出错：无法使用模型配置调用 ThenRnnForward：[rnn_mode, rnn_input_mode, rnn_direction_mode]

Question

尝试执行时出现以下错误

%cd /content/DeepSpeech
!python3 DeepSpeech.py --train_cudnn True --early_stop True --es_epochs 6 --n_hidden 2048 --epochs 20 \
  --export_dir /content/models/ --checkpoint_dir /content/model_checkpoints/ \
  --train_files /content/train.csv --dev_files /content/dev.csv --test_files /content/test.csv \
  --learning_rate 0.0001 --train_batch_size 64 --test_batch_size 32 --dev_batch_size 32 --export_file_name 'ft_model' \
   --augment reverb[p=0.2,delay=50.0~30.0,decay=10.0:2.0~1.0] \
   --augment volume[p=0.2,dbfs=-10:-40] \
   --augment pitch[p=0.2,pitch=1~0.2] \
   --augment tempo[p=0.2,factor=1~0.5]

tensorflow.python.framework.errors_impl.InternalError：发现 2 个根错误。(0) 内部：使用模型配置调用 ThenRnnForward 失败：[rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0, [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 798, 64, 2048] [[{{node tower_0/cudnn_lstm/CudnnRNNV3}}]] [[tower_0/gradients/tower_0/BiasAdd_2_grad/BiasAddGrad/_87]] （1）内部：使用模型配置调用 ThenRnnForward 失败： [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0, [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 798, 64, 2048] [[{{node tower_0/cudnn_lstm/CudnnRNNV3}}]] 0 次成功操作。0 派生错误被忽略。

score 0 · Accepted Answer

如果我按以下方式尝试它，它工作正常。

%cd /content/DeepSpeech
!python3 DeepSpeech.py --train_cudnn True --early_stop True --es_epochs 6 --n_hidden 2048 --epochs 20 \
  --export_dir /content/models/ --checkpoint_dir /content/model_checkpoints/ \
  --train_files /content/train.csv --dev_files /content/dev.csv --test_files /content/test.csv \
  --learning_rate 0.0001 --train_batch_size 64 --test_batch_size 32 --dev_batch_size 32 --export_file_name 'ft_model' \
  # --augment reverb[p=0.2,delay=50.0~30.0,decay=10.0:2.0~1.0] \
  # --augment volume[p=0.2,dbfs=-10:-40] \
  # --augment pitch[p=0.2,pitch=1~0.2] \
  # --augment tempo[p=0.2,factor=1~0.5]

基本上，增强是在做一些事情来打破我们之间的训练

score 0 · Accepted Answer

最好的猜测是 TensorFlow 内存不足。在这两种情况下，开发、测试和训练的批量大小都非常大，但增加需要额外的内存。尝试放下batch_size，看看训练是否开始，如果是，逐渐增加。

python - 在 deepspeech 内部训练期间出错：无法使用模型配置调用 ThenRnnForward：[rnn_mode, rnn_input_mode, rnn_direction_mode]

2 回答 2

Related

Reference