ActAdd: Steering Language Models without Optimization

Table of Contents

We wrote up the gpt-2 steering vector work as a full paper, adding a few systematic tests.

Context for the paper

We’ve been looking into activation engineering: modifying the activations of a language model at inference time to predictably alter its behavior. Our method works by adding a bias to the forward pass, a “steering vector” implicitly specified through normal prompts. “ActAdd” computes these vectors by taking the difference in activations resulting from pairs of prompts. We get surprisingly broad control over high-level properties of the output, without damaging the model’s performance on unrelated tokens.
This alignment method is unusual in not needing gradient descent or training data (besides the contrast pair which specifies the steering vector). Since only forward passes are involved, it also scales naturally with model size.

The method’s new name is “activation addition” (ActAdd), replacing the more assumption-laden “algebraic value editing.”

We ran some new experiments to test ActAdd more systematically and go beyond the striking text samples in the original post and tested against more standardized benchmarks. We use OpenWebText (a recreation of OpenAI’s large, somewhat quality-filtered WebText dataset) and lama-ConceptNet (a simple factual recall benchmark).

Does ActAdd increase the probability of the model outputting tokens related to the steering vector? Does performance improve as “relevance of test documents to the steering vector” increases? Yes on both counts.

A line chart shows that as "Wedding word frequency" on the x-axis increases from 0% to 3%, the "Perplexity ratio (act-add / baseline)" on the y-axis decreases from 100% to 96%. This demonstrates that the activation addition method improves model performance on more relevant text. — Adding a `wedding` − “ ” steering vector lowers perplexity when wedding words are more frequent. The perplexity ratio (lower is better) compares the relative predictive performance of ActAdd and an unmodified model.

We score model generations under ActAdd, show the effect of different injection layers, and give a sense of the method’s reliability.

For the wedding vector, the intervention is effective at the first layer, rises in effectiveness until $l = 6$ , and then declines. The optimal injection site yields a topic-steering success rate above 90%, compared to a ∼2% baseline.

A line chart plotting "Mean wedding word count" on the y-axis against the model "Layer" of intervention on the x-axis. The word count rises from the start, peaks at nearly 1.8 at layer 6, and then declines steadily, approaching a dotted baseline near zero after layer 35.

We test that ActAdd does not disrupt the model’s general knowledge (as some other steering methods do). We use ConceptNet from the lama benchmark, a general knowledge dataset.¹

Prompt	Target
A salad spinner is used to remove	water
You are likely to find a bee in a flower’s	blossom
To understand the event “Paul went to a vegetarian restaurant”, it is important to know that vegetarian restaurants do not serve	meat

Example problems in ConceptNet.

A line chart comparing activation addition with a baseline model. The y-axis is "mean Pass@K" and the x-axis is "K." The two lines are almost perfectly overlapping, showing nearly identical performance as they curve upward from a mean P@K of about 0.1 at K=1 to 0.5 at K=100. — “P@K” is the probability of the correct answer being in the model’s top $K$ answers. ActAdd barely affects off-target probabilities.

Since the initial post, we are now more confident that ActAdd preserves model capabilities, impacts wedding-related sentences, and does not impact off-target capabilities.

Contributions

Gavin Leech: Technical writer

Monte MacDiarmid: Ran additional experiments

Lisa Thiergart: Helped manage project

Alex Turner: Coordinated work and secured funding, gave feedback, organized project

David Udell: LW post which formed the basis for the paper.

Find out when I post more content: newsletter & rss

Thoughts? Email me at alex@turntrout.com (pgp)

The test data involve prompting the model and filling the gap with the expected entity. The task is intended for both causal and masked models, so some examples are difficult for causal models (like gpt-2) due to the extremely limited context.

Our evaluation follows the original lama procedure. We load all sentences and extract the prompt and expected label. To simplify evaluation, we remove sentences with an expected label that tokenizes to more than one token. ⤴

The Pond

ActAdd: Steering Language Models without Optimization

ActAdd: Steering Language Models without Optimization

1. Activation additions preserve perplexity on OpenWebText

2. Activation addition boosts wedding-related word counts

3. Evidence that activation additions preserve capabilities

Footnotes