All terms
Hardware & Systems
All-Reduce
A collective operation that combines values across many workers and shares the result with all.
Definition
All-reduce is a distributed operation that aggregates values held on separate workers — typically by summing them — and returns the combined result to every worker. It is the backbone of data-parallel training, where copies of the model run on separate devices: each device computes gradients (the adjustment signals that tell the model how to improve) on its own batch of data, and all-reduce averages them so every copy stays in sync. Its efficiency depends heavily on the interconnect, since every device must exchange data each step.