We wrote up the gpt-2 steering vector work as a full paper, adding a few systematic tests.

Context for the paper

We’ve been looking into activation engineering: modifying the activations of a language model at inference time to predictably alter its behavior. Our method works by adding a bias to the forward pass, a “steering vector” implicitly specified through normal prompts. “ActAdd” computes these vectors by taking the difference in activations resulting from pairs of prompts. We get surprisingly broad control over high-level properties of the output, without damaging the model’s performance on unrelated tokens.

This alignment method is unusual in not needing gradient descent or training data (besides the contrast pair which specifies the steering vector). Since only forward passes are involved, it also scales naturally with model size.

The method’s new name is “activation addition” (ActAdd), replacing the more assumption-laden “algebraic value editing.”

We ran some new experiments to test ActAdd more systematically and go beyond the striking text samples in the original post and tested against more standardized benchmarks. We use OpenWebText (a recreation of OpenAI’s large, somewhat quality-filtered WebText dataset) and lama-ConceptNet (a simple factual recall benchmark).

Does ActAdd increase the probability of the model outputting tokens related to the steering vector? Does performance improve as “relevance of test documents to the steering vector” increases? Yes on both counts.

Adding a wedding − “ ” steering vector lowers perplexity when wedding words are more frequent. The perplexity ratio (lower is better) compares the relative predictive performance of ActAdd and an unmodified model.

We score model generations under ActAdd, show the effect of different injection layers, and give a sense of the method’s reliability.

For the wedding vector, the intervention is effective at the first layer, rises in effectiveness until , and then declines. The optimal injection site yields a topic-steering success rate above 90%, compared to a ∼2% baseline.

We test that ActAdd does not disrupt the model’s general knowledge (as some other steering methods do). We use ConceptNet from the lama benchmark, a general knowledge dataset.1

PromptTarget
A salad spinner is used to removewater
You are likely to find a bee in a flower’sblossom
To understand the event “Paul went to a vegetarian restaurant”, it is important to know that vegetarian restaurants do not servemeat
Example problems in ConceptNet.

“P@K” is the probability of the correct answer being in the model’s top answers. ActAdd barely affects off-target probabilities.

Since the initial post, we are now more confident that ActAdd preserves model capabilities, impacts wedding-related sentences, and does not impact off-target capabilities.

Contributions
  • Gavin Leech: Technical writer
  • Monte MacDiarmid: Ran additional experiments
  • Lisa Thiergart: Helped manage project
  • Alex Turner: Coordinated work and secured funding, gave feedback, organized project
  • David Udell: LW post which formed the basis for the paper.
Black and white trout

Find out when I post more content: newsletter & RSSRSS icon

Thoughts? Email me at alex@turntrout.com

  1. The test data involve prompting the model and filling the gap with the expected entity. The task is intended for both causal and masked models, so some examples are difficult for causal models (like gpt-2) due to the extremely limited context.

    Our evaluation follows the original lama procedure. We load all sentences and extract the prompt and expected label. To simplify evaluation, we remove sentences with an expected label that tokenizes to more than one token.