Optimization

Distillation

Training a small 'student' model to mimic a larger 'teacher' model.

Definition

Knowledge distillation trains a smaller, cheaper 'student' model to reproduce the outputs (or internal distributions) of a larger 'teacher' model. The student captures much of the teacher's capability at a fraction of the size and cost, a common way to ship capable models on limited hardware.

Related terms

Quantization Fine-Tuning LLM