LiteRT now supports
converting weights to 8 bit precision as part of model conversion from
tensorflow graphdefs to LiteRT's flat buffer format. Dynamic range quantization achieves a 4x reduction in the model size. In addition, TFLite supports on the fly quantization and dequantization of activations to allow for:
Using quantized kernels for faster implementation when available.
Mixing of floating-point kernels with quantized kernels for different parts
of the graph.
The activations are always stored in floating point. For ops that
support quantized kernels, the activations are quantized to 8 bits of precision
dynamically prior to processing and are de-quantized to float precision after
processing. Depending on the model being converted, this can give a speedup over
pure floating point computation.
In contrast to
quantization aware training
, the weights are quantized post training and the activations are quantized dynamically
at inference in this method.
Therefore, the model weights are not retrained to compensate for quantization
induced errors. It is important to check the accuracy of the quantized model to
ensure that the degradation is acceptable.
This tutorial trains an MNIST model from scratch, checks its accuracy in
TensorFlow, and then converts the model into a LiteRT flatbuffer
with dynamic range quantization. Finally, it checks the
accuracy of the converted model and compare it to the original float model.
# Load MNIST datasetmnist=keras.datasets.mnist(train_images,train_labels),(test_images,test_labels)=mnist.load_data()# Normalize the input image so that each pixel value is between 0 to 1.train_images=train_images/255.0test_images=test_images/255.0# Define the model architecturemodel=keras.Sequential([keras.layers.InputLayer(input_shape=(28,28)),keras.layers.Reshape(target_shape=(28,28,1)),keras.layers.Conv2D(filters=12,kernel_size=(3,3),activation=tf.nn.relu),keras.layers.MaxPooling2D(pool_size=(2,2)),keras.layers.Flatten(),keras.layers.Dense(10)])# Train the digit classification modelmodel.compile(optimizer='adam',loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),metrics=['accuracy'])model.fit(train_images,train_labels,epochs=1,validation_data=(test_images,test_labels))
For the example, since you trained the model for just a single epoch, so it only trains to ~96% accuracy.
Convert to a LiteRT model
Using the LiteRT Converter, you can now convert the trained model into a LiteRT model.
Repeat the evaluation on the dynamic range quantized model to obtain:
print(evaluate_model(interpreter_quant))
In this example, the compressed model has no difference in the accuracy.
Optimizing an existing model
Resnets with pre-activation layers (Resnet-v2) are widely used for vision applications.
Pre-trained frozen graph for resnet-v2-101 is available on
Kaggle Models.
You can convert the frozen graph to a LiteRT flatbuffer with quantization by:
# Convert to LiteRT without quantization
resnet_tflite_file = tflite_models_dir/"resnet_v2_101.tflite"
resnet_tflite_file.write_bytes(converter.convert())
# Convert to LiteRT with quantization
converter.optimizations = [tf.lite.Optimize.DEFAULT]
resnet_quantized_tflite_file = tflite_models_dir/"resnet_v2_101_quantized.tflite"
resnet_quantized_tflite_file.write_bytes(converter.convert())
ls-lh{tflite_models_dir}/*.tflite
The model size reduces from 171 MB to 43 MB.
The accuracy of this model on imagenet can be evaluated using the scripts provided for TFLite accuracy measurement.
The optimized model top-1 accuracy is 76.8, the same as the floating point model.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2026-05-28 UTC."],[],[]]