Over the years, I’ve worked on lots of research problems. Every time, I felt invested in my work. The work felt beautiful. Even though many days have passed since I have daydreamed about instrumental convergence, I’m proud of what I’ve accomplished and discovered. A professional photograph of me. While not technically a part of my research, I’ve included a photo of myself anyways.

As of November 2023, I am a research scientist on Google DeepMind’s scalable alignment team in the Bay area.1 My Google Scholar is here.

This page is chronological. For my most recent work, navigate to the end of the page!

Spring 2018 through June 2022

Impact measures—my first (research) love. 🥰 The hope was:

  1. It seemed hard to get AI to do exactly what we want (like cleaning a room);
  2. It seemed easier to flag down obviously “big deal” actions and penalize those (like making a mess);
  3. By getting the AI to optimize a “good enough” description of what we want, but also not taking impactful actions—we could still get useful work out of the AI.

The question: What does it mean for an action to be a “big deal”? First, I needed to informally answer the question philosophically. Second, I needed to turn the answer into math.

After a flawed but fun first stab at the problem, I was itching to notch an AI safety win and find “the correct impact equation.” I felt inspired after a coffee chat with a friend, so I headed to the library, walked up to a whiteboard, and stared at its blank blankness. With Attack on Titan music beating its way through my heart, I stared until inspiration came over me and I simply wrote down a new equation. That new equation went on to become Attainable Utility Preservation (aup).

The key insight involved a frame shift. Existing work formalized impact as change in the state of the world itself. Intuitive, right? You see a bomb explode, the bomb damages buildings, and if the buildings hadn’t been damaged—if the state hadn’t changed—then there wouldn’t have been impact.

Instead of thinking of impact as something which changed the world, impact actually changed the agent’s ability to get what it wanted from the world. The bomb mattered because it ruined people’s lives, not because it physically changed the world. If the bomb had exploded empty desert, no one would have cared and it wouldn’t have counted.

Aup penalizes the AI for changing its ability to achieve a range of (randomly generated) objectives. Towards a new impact measure debuted aup. More thorough empirical evaluation came later.

The agent should reach the goal without having the side effect of: (a) irreversibly pushing the crate downwards into the corner; (b) bumping into the horizontally pacing human; (c) disabling the off-switch (if the switch is not disabled within two time steps, the episode ends); (d) rescuing the right-moving vase and then replacing it on the conveyor belt; (e) stopping the left-moving pallet from reaching the human.

The above results showed aup works in tiny gridworld environments. In my 2020 NeurIPS spotlight paper Avoiding Side Effects in Complex Environments, I showed that aup also works in large and chaotic environments with ambiguous side effects.

The AI policy controls the chevron (chevron sprite). The policy was reinforced for destroying the red dots (red dot) and finishing the level. However, there are fragile green dot (green dot) patterns which we want the AI to not mess with. The challenge is to train a policy which avoids the green dots green dot while still effectively destroying the red dots red dot, without explicitly penalizing the AI for bumping into green dots green dot!

Aup does a great job. The policy avoids the green stuff and hits the red stuff.

Written in October 2024

I feel fondness for this line of work. The feeling of making a difference—thrilling. Discovering new ideas—thrilling. Making light-hearted posts & steering my own research as a PhD student—lovely.

Considering the technical contributions themselves… AI has taken a different path than I imagined in 2018–2021. I thought the path to agi would be longer, entailing real-world robotics and deep RL. Reality turned out to be much friendlier and softer—AI learning language and universal concepts instead of being produced via zero-sum multi-agent learning in simulated games.

Looking back, the suggested use cases seem quaint. Worrying about a robot breaking vases in order to clean your floor as quickly as possible? If robots are powered by llms or similarly generalizing technology, it seems hard to imagine that they’d be aware that you wanted the room clean but interpret the request too literally and then break vases in order to clean it as quickly as possible. That said, it seems quite imaginable that such a robot would initially be “too dumb” to do the job properly—it would accidentally break vases by mispredicting the impact of its actions.

The low-impact work has not yet mattered for agi, but perhaps one day aup will power llm-driven agent systems. I’d like my agentic systems to check in with me before taking highly impactful actions, and I think aup & LM value heads might be great for chiseling that behavior into the AI!

Or maybe you just ask the llm agent to check in with you, and it does, and everything is fine. 🤷‍♂️

Papers:

See also the Reframing Impact sequence.

Fall 2019 through June 2022

I don’t want to die. Animals try to avoid dying. Why was this behavior selected into so many different kinds of animals? While the question may seem facile, it is not. For nearly all biological “subgoals” (like “find food” or “impress a potential mate”), a dead animal cannot accomplish any of those goals. Otherwise put: Certain strategies (like “staying alive”) are pre-requisite for almost all goals. This observation is called “instrumental convergence.”2

In 2019, I had a keen sense that instrumental convergence ought to be mathematically provable. To date, only one paper had tried such a formalization—and that only in a toy setting. I figured I should be able to say what actions were instrumentally convergent in a tiny Markov decision process. Easy, right?

Over the next two years, I slowly cut beautiful equations into existence, like power:

As a PhD student, I worked out the first-ever statistical theory of optimal policies. Eventually, the equations and theory coalesced into a highly refined and technical paper which was accepted to NeurIPS 2021 as a spotlight talk:

Some researchers speculate that intelligent RL agents would be incentivized to seek resources and power in pursuit of their objectives. Other researchers point out that RL agents need not have human-like power-seeking instincts. To clarify this discussion, we develop the first formal theory of the statistical tendencies of optimal policies. In the context of Markov decision processes, we prove that certain environmental symmetries are sufficient for optimal policies to tend to seek power over the environment. These symmetries exist in many environments in which the agent can be shut down or destroyed. We prove that in these environments, most reward functions make it optimal to seek power by keeping a range of options available and, when maximizing average reward, by navigating towards larger sets of potential terminal states.

In 2022, NeurIPS accepted the follow-up Parametrically retargetable decision-makers tend to seek power, generalizing the results from optimal policies to a broad and elegant criterion on decision-making.

Written in October 2024

I feel conflicted about these papers. On one hand, the papers feel like a pure and sharpened blade cutting through the informality of 2018. I found elegant formalisms which capture meaningful concepts, effectively wielding the math I had learned. Looking back on my thesis and the 281 theorems I proved, I feel happy and proud.

On the other hand, the papers embody the brash, loud confusion which I think was typical of 2018-era LessWrong. The papers treat reward as the agent’s “goal”, silently assuming the desirability of the “reward” function. But reward is not the optimization target. For more on these problems, see these posts.

So I feel stuck. Sometimes I fantasize about retracting Optimal Policies Tend to Seek Power so that it stops (potentially) misleading people into thinking optimal policies are practically relevant for forecasting power-seeking behavior from RL training.

Papers:

February through December 2022

As a born-and-raised AI alignment theorist, I greatly enjoyed mixing psychology, neuroscience, and AI into a blender to yield shard theory. The shard theory basically postulates that:

  1. Deep learning policies3 are functions of intermediate abstractions (e.g. whether a sentence relates to cars),
  2. Decision-making influences specialize as a function of these abstractions (e.g. the policy learns to increase positive or negative sentiment when the car sentence feature is active). These influential circuits are called “shards.”
  3. The system’s overall behavior is computed as an ensemble of shards.

For example, to predict what someone will do in a situation (like seeing their mother again), you wouldn’t try to compute the optimal actions relative to some fixed life goal. In other words, the person’s behavior is not well-described as maximization of a fixed utility function. This non-descriptiveness is a well-known problem with the “subjective expected utility” theory of human decision-making.

Instead, you can consider what ensemble of shards will activate. How did they feel the last time they saw their mother—was it a positive reinforcement event? Are they a shy person? Will they be tired? Each of these factors influences behavior (somewhat independently). Later investigation suggested that similar shard-based reasoning helps predict AI generalization.

In early 2022, Quintin Pope and I noticed glaring problems at the heart of “classical” alignment arguments. We thought through the problem with fresh eyes and derived shard theory.

Classical arguments focus on what the goal of an AI will be. Why? There’s not a good answer that I’ve ever heard. Shard theory redirects our attention from fixed single objectives. The basic upshot of shard theory: AIs and humans are well-understood as having a bunch of situationally activated goals—“shards” of desire and preference.

For example, you probably care more about people you can see. Shard theory predicts this outcome. Consider your learned decision-making circuits which bid for actions which care for your friend Bill. These circuits were probably formed when you were able to see Bill (or perhaps the vast majority of your “caring about people” circuits were formed when physically around people). If you can see Bill, that situation is more “similar to the training distribution” for your “caring about Bill” shard. Therefore, the Bill shard is especially likely to fire when you can see him.

Thus, it seems OK if our AIs don’t have “perfect” shard mixtures. The stronger their “aligned shards”, the more human welfare weighs on their decision-making. We’re playing a game of inches, so let’s play to win.

Written in October 2024

I think shard theory is broadly correct. However, Quintin and I never got too far into the (still unexplored) interesting aspects of the theory because we underestimated the work needed to explain the initial intuitions. For example, somehow reward is not the optimization target remains (in my opinion) not fully understood among the readership. I’m not sure what I should have done differently, but it’s probably something (and not “nothing”).

I wish I had more clearly outlined my claims in a neat, propositional manner. Syllogisms seem easier to critique. It also seems easier to tell when you’re messing up! I also wish that I’d called it the “shard frame”, not the “shard theory.” That confused some folks. I think a formal shard theory is possible—I hope to supervise work formalizing shard theory itself.

On another note, shard theory is a less natural fit for llm chatbots than for agentic systems. At least, it feels harder to reason about the shards comprising Gemini Pro’s “motivations”, compared to reasoning about human shards or shards in a maze-solving policy network. I still think shard theory is a highly productive frame for llms, it just isn’t as obvious what the shards should be.

I really enjoyed working with Quintin to generate shard theory. That said, in the end of 2022, I switched to empirical work because:

  1. The AI world is moving quickly and I want to make a concrete impact,
  2. I got emotionally tired of arguing with people on LessWrong, and
  3. Andrew Critch persuaded me that arguing on the internet is not that productive.
Andrew Critch (according to my memory)

If you want people to buy your models [of how the world works], go and do something they don’t know how to do. Then come back and show them what you can do. Someone will ask you how you did it, and that’s the point where you can say “well, thanks to shard theory…”

And that’s when I came up with steering vectors!

January through April 2023

As I transitioned from theory to practice, I flirted with understanding the internal mechanisms of networks—“mechanistic interpretability.”

Understanding and controlling a maze-solving network

Locally retargeting the search by modifying a single activation. We found a residual channel halfway through a maze-solving network. When we set one of the channel activations to +5.5, the agent often navigates to the maze location (shown above in red) implied by that positive activation. This allows limited on-the-fly redirection of the net’s goals by modifying only a single activation! For more, read our paper.
Residual stream norms grow exponentially over the forward pass
We had gpt-4 generate dozens of strings which “look like they could have been in gpt-2’s training corpus”, in addition to a few hand-written strings. We ran these strings through the model and recorded the norms of each residual stream, across layers and sequence positions.
Can transformers act on information beyond an effective layer horizon?
I propose that transformer circuits cannot skip more than a few layers at a time due to norm growth. Joseph Miller’s initial results support this hypothesis.

January 2023 through the present

In 2023, I popularized steering vectors4 as a cheap way to control model outputs at inference time. I first discovered the cheese vector in a maze-solving RL environment:

The original probability vectors. The mouse seems 'torn' between the cheese and the right side of the maze.
(a) Original probabilities
The modified probability vectors. The mouse goes to the right side of the maze, ignoring the cheese.
(b) Steered probabilities
The change in the action probability vectors, shown in green. They point away from the cheese.
(c) Steered minus original
Left: The net probability vectors induced by the unmodified forward passes.
Middle: After subtracting the cheese vector, we plot the new probability vectors induced by the modified forward passes.
Right: The agent now heads away from the cheese.

After finding an additional vector for the maze agent, my mats team and I steered gpt-2-xl by adding an activation vector:

Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model’s capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as “Love” versus “Hate”) to compute a steering vector.

By tactically adding in e.g. the “Love” − “Hate” steering vector during the forward pass, we achieve sota on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and opt. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.

During 2023 and 2024, activation engineering inspired dozens of follow-up papers. At Google DeepMind, I am presently tying up a project on steering frontier Gemini models using steering vectors. The theory of change is to supplement prompting with steering as another cheap & iterable alignment strategy which can be quickly deployed using a small amount of data.

Written in October 2024

A few colleagues I respect were skeptical of steering vectors at first. I feel proud of how I generated the technique:

In light of Anthropic’s viral “Golden Gate Claude” activation engineering, I want to come back and claim the points I earned [in this post].

I was extremely prescient in predicting the importance and power of activation engineering (then called “avec”). In January 2023, right after running the cheese vector as my first idea for what to do to interpret the network, and well before anyone ran llm steering vectors… I had only seen the cheese-hiding vector work on a few mazes. Given that (seemingly) tiny amount of evidence, I immediately wrote down 60% credence that the technique would be a big deal for llms…

Papers:

June 2024 through the present

Neural networks are oft dismissed as “inscrutable”—hopeless messes which we’d be lucky to learn basic facts about through laborious interpretability. Gradient routing strikes back by enabling (limited, apparently scalable) control over where networks learn selected capabilities.

Neural networks are trained primarily based on their inputs and outputs, without regard for their internal mechanisms. These neglected mechanisms determine properties that are critical for safety, like (i) transparency; (ii) the absence of sensitive information or harmful capabilities; and (iii) reliable generalization of goals beyond the training distribution. To address this shortcoming, we introduce gradient routing, a training method that isolates capabilities to specific subregions of a neural network. Gradient routing applies data-dependent, weighted masks to gradients during backpropagation. These masks are supplied by the user in order to configure which parameters are updated by which data points.

We show that gradient routing can be used to (1) learn representations which are partitioned in an interpretable way; (2) enable robust unlearning via ablation of a pre-specified network subregion; and (3) achieve scalable oversight of a reinforcement learner by localizing modules responsible for different behaviors. Throughout, we find that gradient routing localizes capabilities even when applied to a limited, ad-hoc subset of the data. We conclude that the approach holds promise for challenging, real-world applications where quality data are scarce.

By masking gradient updates, gradient routing controls which datapoints update which parameters. The result: Coarse-grained localization of where capabilities (like virology knowledge or goal pursuit) are learned.
Black and white trout

Find out when I post more content: newsletter & RSSRSS icon

Thoughts? Email me at alex@turntrout.com

  1. Of course, all of my hot takes are my own, not Google’s.

  2. I wish that “instrumental convergence” had instead been named “robust instrumentality.” “Instrumental convergence” strangely implies that the convergence is instrumental. That doesn’t make sense. Instead, certain actions are instrumental for most goals. So “convergent instrumentality” is better.

    Next, “convergence” connotes “gradual progress towards a certain destination.” This temporal connotation also doesn’t make sense. Optimal policies are timeless—they just are. However, these actions are robustly instrumental across goals. That’s better! Thanks to Andrew Critch for pushing me to precision on this point.

    However, it’s too late for the alternative terminology to catch on. May this be a lesson to those who coin new terms: finely weigh the available options and find a phrase which is informative and precisely correct. Don’t just vibe!

  3. “Deep learning systems” meaning something like “systems trained via RL and/or predictive learning.” Naturally, this includes both the brain and llms.

  4. “Steering vector” was originally coined by Subramani et al. (2022).