All terms
Inference & Serving
Autoscaling
Automatically adding or removing serving capacity as demand rises and falls.
Definition
Autoscaling adjusts the number of running copies of a model automatically in response to load, adding capacity when traffic spikes and removing it when demand drops. It keeps response times stable under bursty usage while avoiding the cost of running idle hardware. Because loading large models is slow, autoscaling for live serving must contend with cold-start delays — the wait while fresh capacity loads the model.