Introduction to Transformers
The transformer architecture and the simple attention mechanism
The Learnable Attention Mechanism
Implementing the scaled dot-product self attention mechanism
Multi-Head Self Attention
Using multiple attention heads to capture different aspects of input sequences

