Skip to main content
All terms
Hardware & Systems

Distributed Inference

Running model inference across multiple devices or machines instead of just one.

Definition

Distributed inference serves a model's predictions across several GPUs or machines rather than a single device. It becomes necessary when a model is too large to fit in one accelerator's memory, or when traffic demands more throughput than one device can supply. Techniques such as tensor parallelism and pipeline parallelism split the model across devices, while the interconnect speed between them strongly affects latency and efficiency.