Skip to main content
All terms
Optimization

Structured Pruning

Removing whole heads, neurons, or layers so the model stays dense and fast.

Definition

Structured pruning removes coherent groups of parameters — whole attention heads, feedforward neurons, embedding dimensions, or entire layers. Because the result stays dense, it runs on standard matrix-multiplication hardware without needing sparse kernels. It usually reaches less extreme sparsity than unstructured pruning for the same accuracy, but it delivers more reliable real-world speedups, making it attractive for deployment.