All terms
Architectures
Layer Normalization
A technique that rescales the activations within each example to keep their scale consistent.
Definition
Layer normalization computes the mean and variance across the features of each individual example, then uses them to rescale that example's activations. Because it works per sample rather than across a batch, it suits sequence models where batch normalization is impractical. Applied before or after the attention and feedforward sublayers, it stabilizes and speeds up training of very deep networks. Transformers rely on it, often a variant called RMSNorm.