Training

Expert Parallelism

Spreading the experts of a Mixture-of-Experts model across multiple devices.

Definition

Expert parallelism distributes the many expert sub-networks of a Mixture-of-Experts model across devices, so each device holds only some of the experts. For a given token, only the experts chosen by the router are loaded and computed on their assigned device. This is essential for serving the largest sparse MoE models, which would otherwise exceed the memory of any single node.

Expert Parallelism

Definition

Related terms