Skip to main content

Introduction to Transformers

The transformer architecture and the simple attention mechanism

The Learnable Attention Mechanism

Implementing the scaled dot-product self attention mechanism

Multi-Head Self Attention

Using multiple attention heads to capture different aspects of input sequences