Skip to main content
All terms
Optimization

Distillation

Training a small 'student' model to mimic a larger 'teacher' model.

Definition

Knowledge distillation trains a smaller, cheaper 'student' model to reproduce the outputs (or internal distributions) of a larger 'teacher' model. The student captures much of the teacher's capability at a fraction of the size and cost, a common way to ship capable models on limited hardware.