Skip to main content
All terms
Optimization

INT8

Representing weights and activations as 8-bit integers to shrink and speed up inference.

Definition

INT8 quantization converts a model's numbers — normally stored as fine-grained decimals — into whole numbers from a fixed range of 256 values, using a scaling factor to map between the two. It cuts model size by roughly four times versus 32-bit decimals and enables faster math on CPUs and GPUs. INT8 is used mainly when running a trained model rather than during training. Post-training quantization and quantization-aware training are common ways to limit accuracy loss.