Activation steering is a technique for controlling LLM outputs without fine-tuning. Instead of updating weights, you extract a steering vector from the model’s residual stream — a direction in activation space that corresponds to a target concept — and add it at inference time. The model’s behavior shifts predictably along that direction while all other capabilities remain intact.
How it works
- Collect contrastive pairs — gather prompts that do and don’t express the target concept (e.g. “Paris” / neutral)
- Extract activations — run both sets through the model and record the residual stream at a chosen layer
- Compute the steering vector — take the mean difference between the two activation sets
- Apply at inference — add
α × steering_vectorto the residual stream during the forward pass; scaleαto control intensity
Demo
The Eiffel Tower Llama space demonstrates this interactively: a steering vector derived from Eiffel Tower–related activations is injected into Llama, progressively shifting its completions toward Paris-related content.Lab
Lab notebook under development. Track progress in AURA-655.
- Extracting a concept steering vector from a small open model (Llama 3.2 1B)
- Applying it at varying strengths (
α) and observing output drift - Visualising the activation geometry using PCA
Further reading
- Zou et al. (2023) — Representation Engineering: A Top-Down Approach to AI Transparency
- Turner et al. (2023) — Activation Addition: Steering Language Models Without Optimization

