This example will work through fine-tuning a BERT model using the Orbit training library.
Orbit is a flexible, lightweight library designed to make it easy to write custom training loops in TensorFlow. Orbit handles common model training tasks such as saving checkpoints, running model evaluations, and setting up summary writing, while giving users full control over implementing the inner training loop. It integrates with tf.distribute and supports running on different device types (CPU, GPU, and TPU).
Most examples on tensorflow.org use custom training loops or model.fit() from Keras. Orbit is a good alternative to model.fit if your model is complex and your training loop requires more flexibility, control, or customization. Also, using Orbit can simplify the code when there are many different model architectures that all use the same custom training loop.
This tutorial focuses on setting up and using Orbit, rather than details about BERT, model construction, and data processing. For more in-depth tutorials on these topics, refer to the following tutorials:
Fine tune BERT - which goes into detail on these sub-topics.
The tf-models-official package contains both the orbit and tensorflow_models modules.
importtensorflow_modelsastfmimportorbit
2023-10-17 11:55:57.421119: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-17 11:55:57.421164: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-17 11:55:57.421203: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Setup for training
This tutorial does not focus on configuring the environment, building the model and optimizer, and loading data. All these techniques are covered in more detail in the Fine tune BERT and Fine tune BERT with GLUE tutorials.
To view how the training is set up for this tutorial, expand the rest of this section.
While tf.distribute won't help the model's runtime if you're running on a single machine or GPU, it's necessary for TPUs. Setting up a distribution strategy allows you to use the same code regardless of the configuration.
2023-10-17 11:56:02.076511: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2211] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
For more information about the TPU setup, refer to the TPU guide.
When using Orbit, the orbit.Controller class drives the training. The Controller handles the details of distribution strategies, step counting, TensorBoard summaries, and checkpointing.
To implement the training and evaluation, pass a trainer and evaluator, which are subclass instances of orbit.AbstractTrainer and orbit.AbstractEvaluator. Keeping with Orbit's light-weight design, these two classes have a minimal interface.
The Controller drives training and evaluation by calling trainer.train(num_steps) and evaluator.evaluate(num_steps). These train and evaluate methods return a dictionary of results for logging.
Training is broken into chunks of length num_steps. This is set by the Controller's steps_per_loop argument. With the trainer and evaluator abstract base classes, the meaning of num_steps is entirely determined by the implementer.
Some common examples include:
Having the chunks represent dataset-epoch boundaries, like the default keras setup.
Using it to more efficiently dispatch a number of training steps to an accelerator with a single tf.function call (like the steps_per_execution argument to Model.compile).
With StandardTrainer, you only need to set train_loop_begin, train_step, and train_loop_end. The base class handles the loops, dataset logic, and tf.function (according to the options set by their orbit.StandardTrainerOptions). This is simpler than orbit.AbstractTrainer, which requires you to handle the entire loop. StandardEvaluator has a similar structure and simplification to StandardTrainer.
This is effectively an implementation of the steps_per_execution approach used by Keras.
Contrast this with Keras, where training is divided both into epochs (a single pass over the dataset) and steps_per_execution(set within Model.compile. In Keras, metric averages are typically accumulated over an epoch, and reported & reset between epochs. For efficiency, steps_per_execution only controls the number of training steps made per call.
In this simple case, steps_per_loop (within StandardTrainer) will handle both the metric resets and the number of steps per call.
The minimal setup when using these base classes is to implement the methods as follows:
Depending on the settings, the base class may wrap the train_step and eval_step code in tf.function or tf.while_loop, which has some limitations compared to standard python.
The train_step is a straight-forward loss-calculation and gradient update that is run by the distribution strategy. This is accomplished by defining the gradient step as a nested function (step_fn).
deftrain_step(self,iterator):defstep_fn(inputs):labels=inputs.pop("label_ids")withtf.GradientTape()astape:model_outputs=self.model(inputs,training=True)# Raw loss is used for reporting in metrics/logs.raw_loss=loss_fn(labels,model_outputs)# Scales down the loss for gradients to be invariant from replicas.loss=raw_loss/self.strategy.num_replicas_in_syncgrads=tape.gradient(loss,self.model.trainable_variables)optimizer.apply_gradients(zip(grads,self.model.trainable_variables))# For reporting, the metric takes the mean of losses.self.train_loss.update_state(raw_loss)self.strategy.run(step_fn,args=(next(iterator),))
The evaluator is even simpler for this task. It needs access to the evaluation dataset, the model, and the strategy. After saving references to those objects, the constructor just needs to create the metrics.
The eval_step method works like train_step. The inner step_fn defines the actual work of calculating the loss & accuracy and updating the metrics. The outer eval_step receives tf.distribute.DistributedIterator as input, and uses Strategy.run to launch the distributed execution to step_fn, feeding it from the distributed iterator.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2023-10-17 UTC."],[],[]]