tensorflow2.0 - 微调 TFBertForMaskedLM model.fit() ValueError

Question

问题

我一直在尝试用 tensorflow 训练 TFBertForMaskedLM 模型。但是当我使用 model.fit() 时总是会遇到一些问题。希望有人可以提供帮助并提出一些解决方案。

参考论文和样本输出

论文题目是“Conditional Bert for Contextual Augmentation”。简而言之，只需将 type_token_ids 更改为 label_ids 即可。如果句子的标签为5，长度为10，max_sequence_length = 16。它将处理输出如下：

input_ids = [101, 523, 791, 3189, 677, 5221, 524, 1920, 686, 102, 0, 0, 0, 0, 0, 0]
attention_mask = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
token_type_ids = [5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 0, 0, 0, 0, 0, 0]
labels = [-100, -100, 791, -100, 677, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]

环境

张量流 == 2.2.0
拥抱脸 == 3.5.0
数据集 == 1.1.2
数据集总标签为 5。（1~5）
显卡：GCP P100 * 1

数据集输出（max_sequence_length=128，batch_size=1）

{'attention_mask': <tf.Tensor: shape=(128,), dtype=int32, numpy=
 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)>,
 'input_ids': <tf.Tensor: shape=(128,), dtype=int32, numpy=
 array([  101,   523,   791,  3189,   677,  5221,   524,  1920,   686,
         4518,  6240,   103,  2466,  2204,  2695,   100,   519,  5064,
         1918,   736,  2336,   520,   103,  2695,  1564,  4923,  8013,
          678,  6734,  8038,  8532,   131,   120,   120,  8373,   119,
          103,  9989,   103,  8450,   120,   103,   120, 12990,  8921,
         8165,   102,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0], dtype=int32)>,
 'labels': <tf.Tensor: shape=(128,), dtype=int32, numpy=
 array([-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        4634, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        4158, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, 8429, -100,  119, -100, -100,  100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100], dtype=int32)>,
 'token_type_ids': <tf.Tensor: shape=(128,), dtype=int32, numpy=
 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)>}

型号代码

from transformers import AdamWeightDecay, TFBertForMaskedLM, BertConfig

def create_model():
    configuration = BertConfig.from_pretrained('bert-base-chinese')
    model = TFBertForMaskedLM.from_pretrained('bert-base-chinese',
                                              config=configuration)
    model.bert.embeddings.token_type_embeddings = tf.keras.layers.Embedding(5, 768, 
                                                                            embeddings_initializer=tf.keras.initializers.TruncatedNormal(stddev=0.02))
    return model
model = create_model()

optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = [tf.keras.metrics.Mean(), tf.keras.metrics.SparseCategoricalAccuracy('accuracy')]

model.compile(optimizer = optimizer,
              loss = loss,
              metrics = metrics)

model.fit(tf_sms_dataset, 
          epochs=1,
          verbose=1)

使用 TFBertForMaskedLM 时的警告消息

Some layers from the model checkpoint at bert-base-chinese were not used when initializing TFBertForMaskedLM: ['nsp___cls']
- This IS expected if you are initializing TFBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForMaskedLM were initialized from the model checkpoint at bert-base-chinese.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.

错误信息

ValueError                                Traceback (most recent call last)
<ipython-input-42-99b78906fef7> in <module>()
      5 model.fit(tf_sms_dataset, 
      6           epochs=1,
----> 7           verbose=1)

10 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py in _method_wrapper(self, *args, **kwargs)
     64   def _method_wrapper(self, *args, **kwargs):
     65     if not self._in_multi_worker_mode():  # pylint: disable=protected-access
---> 66       return method(self, *args, **kwargs)
     67 
     68     # Running inside `run_distribute_coordinator` already.

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
    846                 batch_size=batch_size):
    847               callbacks.on_train_batch_begin(step)
--> 848               tmp_logs = train_function(iterator)
    849               # Catch OutOfRangeError for Datasets of unknown size.
    850               # This blocks until the batch has finished executing.

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py in __call__(self, *args, **kwds)
    578         xla_context.Exit()
    579     else:
--> 580       result = self._call(*args, **kwds)
    581 
    582     if tracing_count == self._get_tracing_count():

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py in _call(self, *args, **kwds)
    625       # This is the first call of __call__, so we have to initialize.
    626       initializers = []
--> 627       self._initialize(args, kwds, add_initializers_to=initializers)
    628     finally:
    629       # At this point we know that the initialization is complete (or less

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py in _initialize(self, args, kwds, add_initializers_to)
    504     self._concrete_stateful_fn = (
    505         self._stateful_fn._get_concrete_function_internal_garbage_collected(  # pylint: disable=protected-access
--> 506             *args, **kwds))
    507 
    508     def invalid_creator_scope(*unused_args, **unused_kwds):

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py in _get_concrete_function_internal_garbage_collected(self, *args, **kwargs)
   2444       args, kwargs = None, None
   2445     with self._lock:
-> 2446       graph_function, _, _ = self._maybe_define_function(args, kwargs)
   2447     return graph_function
   2448 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py in _maybe_define_function(self, args, kwargs)
   2775 
   2776       self._function_cache.missed.add(call_context_key)
-> 2777       graph_function = self._create_graph_function(args, kwargs)
   2778       self._function_cache.primary[cache_key] = graph_function
   2779       return graph_function, args, kwargs

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py in _create_graph_function(self, args, kwargs, override_flat_arg_shapes)
   2665             arg_names=arg_names,
   2666             override_flat_arg_shapes=override_flat_arg_shapes,
-> 2667             capture_by_value=self._capture_by_value),
   2668         self._function_attributes,
   2669         # Tell the ConcreteFunction to clean up its graph once it goes out of

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py in func_graph_from_py_func(name, python_func, args, kwargs, signature, func_graph, autograph, autograph_options, add_control_dependencies, arg_names, op_return_value, collections, capture_by_value, override_flat_arg_shapes)
    979         _, original_func = tf_decorator.unwrap(python_func)
    980 
--> 981       func_outputs = python_func(*func_args, **func_kwargs)
    982 
    983       # invariant: `func_outputs` contains only Tensors, CompositeTensors,

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py in wrapped_fn(*args, **kwds)
    439         # __wrapped__ allows AutoGraph to swap in a converted function. We give
    440         # the function a weak reference to itself to avoid a reference cycle.
--> 441         return weak_wrapped_fn().__wrapped__(*args, **kwds)
    442     weak_wrapped_fn = weakref.ref(wrapped_fn)
    443 
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py in wrapper(*args, **kwargs)
    966           except Exception as e:  # pylint:disable=broad-except
    967             if hasattr(e, "ag_error_metadata"):
--> 968               raise e.ag_error_metadata.to_exception(e)
    969             else:
    970               raise

ValueError: in user code:

    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:571 train_function  *
        outputs = self.distribute_strategy.run(
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:951 run  **
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2290 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2649 _call_for_each_replica
        return fn(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:541 train_step  **
        self.trainable_variables)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:1804 _minimize
        trainable_variables))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:521 _aggregate_gradients
        filtered_grads_and_vars = _filter_grads(grads_and_vars)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:1219 _filter_grads
        ([v.name for _, v in grads_and_vars],))

    ValueError: No gradients provided for any variable: ['tf_bert_for_masked_lm_2/bert/embeddings/word_embeddings/weight:0', 'tf_bert_for_masked_lm_2/bert/embeddings/position_embeddings/embeddings:0', 'tf_bert_for_masked_lm_2/bert/embeddings/LayerNorm/gamma:0', 'tf_bert_for_masked_lm_2/bert/embeddings/LayerNorm/beta:0', 'tf_bert_for_masked_lm_2/bert/embeddings/embedding_1/embeddings:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/attention/self/query/kernel:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/attention/self/query/bias:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/attention/self/key/kernel:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/attention/self/key/bias:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/attention/self/value/kernel:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/attention/self/value/bias:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/attention/output/dense/kernel:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/attention/output/dense/bias:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/attention/output/LayerNorm/gamma:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/attention/output/LayerNorm/beta:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/intermediate/dense/kernel:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/intermediate/dense/bias:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/output/dense/kernel:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/output/dense/bias:0', 'tf_bert_f...

有谁能帮忙。我会非常感谢。

其他测试

我用英语句子来测试。示例如下：

from transformers import TFBertForMaskedLM, BertConfig

def create_model():
    configuration = BertConfig.from_pretrained('bert-base-uncased')
    model = TFBertForMaskedLM.from_pretrained('bert-base-uncased',
                                              config=configuration)
    return model
    
model = create_model()
eng_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
token_info = eng_tokenizer(text="We are very happy to show you the  Transformers library.", padding='max_length', max_length=20)

optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = [tf.keras.metrics.Mean(), tf.keras.metrics.SparseCategoricalAccuracy("acc")]

dataset = tf.data.Dataset.from_tensor_slices(dict(token_info))
dataset = dataset.batch(1).prefetch(tf.data.experimental.AUTOTUNE)

model.compile(optimizer = optimizer,
              loss = model.compute_loss,
              metrics = metrics)

model.fit(dataset)

token_info 输出数据集

{
  'input_ids': [101, 2057, 2024, 2200, 103, 2000, 2265, 2017, 103, 100, 19081, 3075, 1012, 102, 0, 0, 0, 0, 0, 0]
  'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
  'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  'labels': [-100, -100, -100, -100, 3407, -100, -100, -100, 1996, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]
}

得到同样的错误......

ValueError: in user code:

    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:571 train_function  *
        outputs = self.distribute_strategy.run(
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:951 run  **
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2290 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2649 _call_for_each_replica
        return fn(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:541 train_step  **
        self.trainable_variables)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:1804 _minimize
        trainable_variables))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:521 _aggregate_gradients
        filtered_grads_and_vars = _filter_grads(grads_and_vars)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:1219 _filter_grads
        ([v.name for _, v in grads_and_vars],))

    ValueError: No gradients provided for any variable: ['tf_bert_for_masked_lm_2/bert/embeddings/word_embeddings/weight:0', 'tf_bert_for_masked_lm_2/bert/embeddings/position_embeddings/embeddings:0', 'tf_bert_for_masked_lm_2/bert/embeddings/token_type_embeddings/embeddings:0', 'tf_bert_for_masked_lm_2/bert/embeddings/LayerNorm/gamma:0', 'tf_bert_for_masked_lm_2/bert/embeddings/LayerNorm/beta:0',

我不确定将 fit() 集成到模型中是否存在问题？