All terms
Training
Expert Parallelism
Spreading the experts of a Mixture-of-Experts model across multiple devices.
Definition
Expert parallelism distributes the many expert sub-networks of a Mixture-of-Experts model across devices, so each device holds only some of the experts. For a given token, only the experts chosen by the router are loaded and computed on their assigned device. This is essential for serving the largest sparse MoE models, which would otherwise exceed the memory of any single node.