Skip to main content
Activation steering is a technique for controlling LLM outputs without fine-tuning. Instead of updating weights, you extract a steering vector from the model’s residual stream — a direction in activation space that corresponds to a target concept — and add it at inference time. The model’s behavior shifts predictably along that direction while all other capabilities remain intact.

How it works

  1. Collect contrastive pairs — gather prompts that do and don’t express the target concept (e.g. “Paris” / neutral)
  2. Extract activations — run both sets through the model and record the residual stream at a chosen layer
  3. Compute the steering vector — take the mean difference between the two activation sets
  4. Apply at inference — add α × steering_vector to the residual stream during the forward pass; scale α to control intensity

Demo

The Eiffel Tower Llama space demonstrates this interactively: a steering vector derived from Eiffel Tower–related activations is injected into Llama, progressively shifting its completions toward Paris-related content.

Lab

Lab notebook under development. Track progress in AURA-655.
The lab will walk you through:
  • Extracting a concept steering vector from a small open model (Llama 3.2 1B)
  • Applying it at varying strengths (α) and observing output drift
  • Visualising the activation geometry using PCA

Further reading