Inference & Serving

Autoscaling

Automatically adding or removing serving capacity as demand rises and falls.

Definition

Autoscaling adjusts the number of running copies of a model automatically in response to load, adding capacity when traffic spikes and removing it when demand drops. It keeps response times stable under bursty usage while avoiding the cost of running idle hardware. Because loading large models is slow, autoscaling for live serving must contend with cold-start delays — the wait while fresh capacity loads the model.

Autoscaling

Definition

Related terms