
Limitations of Batch Normalization
Batch Normalization (BN) helps training efficiency by positioning activations in a trainable way. However, it has limitations with small batch sizes since it operates across the batch dimension and normalizes activations for each feature/channel across the batch. Smaller batch sizes result in inaccurate statistics. This is particularly problematic in LLMs that require small mini-batches due to memory constraints.Layer Normalization
For architectures such as recurrent networks and transformers, we apply Layer Normalization. The layer normalization of an input vector is computed as: where the mean and variance are: Here:- and are learnable parameters (of shape )
- is a small constant for numerical stability
- denotes element-wise multiplication
Key Difference from Batch Normalization
Layer Normalization operates across the feature dimensions for each sample independently, normalizing the activations in a trainable way. This is effectively the transpose of Batch Normalization:| Aspect | Batch Normalization | Layer Normalization |
|---|---|---|
| Normalizes across | Batch dimension | Feature dimension |
| Depends on | Batch size | Independent of batch |
| Best for | CNNs, large batches | RNNs, Transformers, small batches |
| Training vs Inference | Different behavior | Same behavior |
When to Use Layer Normalization
- Recurrent Neural Networks (RNNs): Variable sequence lengths make BN difficult
- Transformers: Self-attention mechanisms benefit from LN
- Small batch training: When memory constraints limit batch size
- Online learning: Single-sample updates
References
- Bengio, Y. (2012). Practical recommendations for gradient-based training of deep architectures.
- Ioffe, S., Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.
- Keskar, N., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P. (2016). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.
- Keskar, N., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P. (2016). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.
- Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O. (2016). Understanding deep learning requires rethinking generalization.

