Inference & Serving

Cold Start

The startup delay before a model or instance is loaded and ready to serve.

Definition

A cold start is the delay that occurs when capacity is not already running and a model must be loaded into the memory of a GPU (the chip that runs the model) before it can serve a request. For large models this can take seconds to minutes, since the weights — the model's learned numbers — must be read from storage and set up. Cold starts complicate autoscaling and scale-to-zero deployments, where idle capacity is removed to save cost.

Cold Start

Definition

Related terms