Layer Normalization

Limitations of Batch Normalization

Batch Normalization (BN) helps training efficiency by positioning activations in a trainable way. However, it has limitations with small batch sizes since it operates across the batch dimension and normalizes activations for each feature/channel across the batch. Smaller batch sizes result in inaccurate statistics. This is particularly problematic in LLMs that require small mini-batches due to memory constraints.

Layer Normalization

For architectures such as recurrent networks and transformers, we apply Layer Normalization. The layer normalization of an input vector

x \in \mathbb{R}^d

is computed as:

\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta

where the mean

\mu

and variance

\sigma^2

are:

\mu = \frac{1}{d} \sum_{i=1}^{d} x_i, \quad \sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2

Here:

$\gamma$ and $\beta$ are learnable parameters (of shape $d$ )
$\epsilon$ is a small constant for numerical stability
$\odot$ denotes element-wise multiplication

Key Difference from Batch Normalization

Layer Normalization operates across the feature dimensions for each sample independently, normalizing the activations in a trainable way. This is effectively the transpose of Batch Normalization:

Aspect	Batch Normalization	Layer Normalization
Normalizes across	Batch dimension	Feature dimension
Depends on	Batch size	Independent of batch
Best for	CNNs, large batches	RNNs, Transformers, small batches
Training vs Inference	Different behavior	Same behavior

When to Use Layer Normalization

Recurrent Neural Networks (RNNs): Variable sequence lengths make BN difficult
Transformers: Self-attention mechanisms benefit from LN
Small batch training: When memory constraints limit batch size
Online learning: Single-sample updates

References

Ba et al., 2016 - Layer Normalization

Edit this page on GitHub or file an issue.

Neural Networks

Backpropagation

Whitening

Normalization

Regularization

Hyperparameter Optimization

Transfer Learning

Layer Normalization

Limitations of Batch Normalization

Layer Normalization

Key Difference from Batch Normalization

When to Use Layer Normalization

References

Neural Networks

Backpropagation

Whitening

Normalization

Regularization

Hyperparameter Optimization

Transfer Learning

​Limitations of Batch Normalization

​Layer Normalization

​Key Difference from Batch Normalization

​When to Use Layer Normalization

​References

Limitations of Batch Normalization

Layer Normalization

Key Difference from Batch Normalization

When to Use Layer Normalization

References