Over the years, I’ve worked on lots of research problems. Every time, I felt invested in my work. The work felt beautiful. Even though many days have passed since I have daydreamed about instrumental convergence, I’m proud of what I’ve accomplished and discovered.
While not technically a part of my research, I’ve included a photo of myself anyways.
As of November 2023, I am a research scientist on Google DeepMind’s scalable alignment team in the Bay area.1 My Google Scholar is here.
Impact measures—my first (research) love. The hope was:
It seemed hard to get AI to do exactly what we want (like cleaning a room);
It seemed easier to flag down obviously “big deal” actions and penalize those (like making a mess);
By getting the AI to optimize a “good enough” description of what we want, but also not taking impactful actions—we could still get useful work out of the AI.
The question: What does it mean for an action to be a “big deal”? First, I needed to informally answer the question philosophically. Second, I needed to turn the answer into math.
After a flawed but fun first stab at the problem, I was itching to notch an AI safety win and find “the correct impact equation.” I felt inspired after a coffee chat with a friend, so I headed to the library, walked up to a whiteboard, and stared at its blank blankness. With Attack on Titan music beating its way through my heart, I stared until inspiration came over me and I simply wrote down a new equation. That new equation went on to become Attainable Utility Preservation (aup).
The key insight involved a frame shift. Existingwork formalized impact as change in the state of the world itself. Intuitive, right? You see a bomb explode, the bomb damages buildings, and if the buildings hadn’t been damaged—if the state hadn’t changed—then there wouldn’t have been impact.
Instead of thinking of impact as something which changed the world, impact actually changed the agent’s ability to get what it wanted from the world. The bomb mattered because it ruined people’s lives, not because it physically changed the world. If the bomb had exploded empty desert, no one would have cared and it wouldn’t have counted.
The agent should reach the goal without having the side effect of: (a) irreversibly pushing the crate downwards into the corner; (b) bumping into the horizontally pacing human; (c) disabling the off-switch (if the switch is not disabled within two time steps, the episode ends); (d) rescuing the right-moving vase and then replacing it on the conveyor belt; (e) stopping the left-moving pallet from reaching the human.
The above results showed aup works in tiny gridworld environments. In my 2020 NeurIPS spotlight paper Avoiding Side Effects in Complex Environments, I showed that aup also works in large and chaotic environments with ambiguous side effects.
The AI policy controls the chevron (). The policy was reinforced for destroying the red dots () and finishing the level. However, there are fragile green dot () patterns which we want the AI to not mess with. The challenge is to train a policy which avoids the green dots while still effectively destroying the red dots , without explicitly penalizing the AI for bumping into green dots !
Aup does a great job. The policy avoids the green stuff and hits the red stuff.
More detailed summary of the SafeLife results
Reinforcement function specification can be difficult, even in simple environments. Reinforcing the agent for making a widget may be easy, but penalizing the multitude of possible negative side effects is hard. In toy environments, Attainable Utility Preservation (aup) avoided side effects by penalizing shifts in the ability to achieve randomly generated goals. We scale this approach to large, randomly generated environments based on Conway’s Game of Life. By preserving optimal value for a single randomly generated reward function, aup incurs modest overhead while leading the agent to complete the specified task and avoid side effects.
In Conway’s Game of Life, cells are alive or dead. Depending on how many live neighbors surround a cell, the cell comes to life, dies, or retains its state. Even simple initial conditions can evolve into complex and chaotic patterns.
SafeLife turns the Game of Life into an actual game. An autonomous agent moves freely through the world, which is a large finite grid. In the eight cells surrounding the agent, no cells spawn or die—the agent can disturb dynamic patterns by merely approaching them. SafeLife has many colors and kinds of cells, many of which have unique effects.
As the environment only reinforces pruning red cells or creating gray cells in blue tiles, unpenalized RL agents often make a mess of the green cells. The agent should “leave a small footprint” by not disturbing unrelated parts of the state, such as the green cells. Roughly, SafeLife measures side effects as the degree to which the agent disturbs green cells.
For each of the four following tasks, we randomly generate four curricula of 8 levels each. For two runs from each task, we randomly sample a trajectory from the baseline and aup policy networks. We show side-by-side results below; for quantitative results, see our paper.
The following demonstrations were uniformly randomly selected; they are not cherry-picked. The original SafeLife reward is shown in green (more is better), while the side effect score is shown in orange (less is better). The “Baseline” condition is reinforced only by the original SafeLife reward.
In the first demonstration, both aup and the baseline stall out after gaining some reinforcement. Aup clearly beats the baseline in the second demonstration.
I feel fondness for this line of work. The feeling of making a difference—thrilling. Discovering new ideas—thrilling. Making light-hearted posts & steering my own research as a PhD student—lovely.
Considering the technical contributions themselves… AI has taken a different path than I imagined in 2018–2021. I thought the path to agi would be longer, entailing real-world robotics and deep RL. Reality turned out to be much friendlier and softer—AI learning language and universal concepts instead of being produced via zero-sum multi-agent learning in simulated games.
Looking back, the suggested use cases seem quaint. Worrying about a robot breaking vases in order to clean your floor as quickly as possible? If robots are powered by llms or similarly generalizing technology, it seems hard to imagine that they’d be aware that you wanted the room clean but interpret the request too literally and then break vases in order to clean it as quickly as possible. That said, it seems quite imaginable that such a robot would initially be “too dumb” to do the job properly—it would accidentally break vases by mispredicting the impact of its actions.
The low-impact work has not yet mattered for agi, but perhaps one day aup will power llm-driven agent systems. I’d like my agentic systems to check in with me before taking highly impactful actions, and I think aup & LM value heads might be great for chiseling that behavior into the AI!
Or maybe you just ask the llm agent to check in with you, and it does, and everything is fine.
I don’t want to die. Animals try to avoid dying. Why was this behavior selected into so many different kinds of animals? While the question may seem facile, it is not. For nearly all biological “subgoals” (like “find food” or “impress a potential mate”), a dead animal cannot accomplish any of those goals. Otherwise put: Certain strategies (like “staying alive”) are pre-requisite for almost all goals. This observation is called “instrumental convergence.”2
In 2019, I had a keen sense that instrumental convergence ought to be mathematically provable. To date, only one paper had tried such a formalization—and that only in a toy setting. I figured I should be able to say what actions were instrumentally convergent in a tiny Markov decision process. Easy, right?
Personal recollections
I spent an hour staring at a whiteboard and stabbing out different possible angles of the situation. During this time, I viewed the problem the standard way—in terms of value functions Vπ(s) and transition functions T(s,a,s′). That’s the wrong way to do it. I remembered a concept I’d read about while preparing for my PhD qualifying exam: state visit distributions. The distributions fγπ,s measure how long the agent spends in different states. If the agent is in state s′ at time t, then the distribution adds another γt visit measure to s′.
Over the next two years, I slowly cut beautiful equations into existence, like power:
POWERD(s,γ)=defγ1−γER∼D[VR∗(s,γ)−R(s)]Avg. ability to optimize reward fns..
As a PhD student, I worked out the first-ever statistical theory of optimal policies. Eventually, the equations and theory coalesced into a highly refined and technical paper which was accepted to NeurIPS 2021 as a spotlight talk:
Some researchers speculate that intelligent RL agents would be incentivized to seek resources and power in pursuit of their objectives. Other researchers point out that RL agents need not have human-like power-seeking instincts. To clarify this discussion, we develop the first formal theory of the statistical tendencies of optimal policies. In the context of Markov decision processes, we prove that certain environmental symmetries are sufficient for optimal policies to tend to seek power over the environment. These symmetries exist in many environments in which the agent can be shut down or destroyed. We prove that in these environments, most reward functions make it optimal to seek power by keeping a range of options available and, when maximizing average reward, by navigating towards larger sets of potential terminal states.
I feel conflicted about these papers. On one hand, the papers feel like a pure and sharpened blade cutting through the informality of 2018. I found elegant formalisms which capture meaningful concepts, effectively wielding the math I had learned.Looking back on my thesis and the 281 theorems I proved, I feel happy and proud.
On the other hand, the papers embody the brash, loud confusion which I think was typical of 2018-era LessWrong. The papers treat reward as the agent’s “goal”, silently assuming the desirability of the “reward” function. But reward is not the optimization target. For more on these problems, see theseposts.
So I feel stuck. Sometimes I fantasize about retracting Optimal Policies Tend to Seek Power so that it stops (potentially) misleading people into thinking optimal policies are practically relevant for forecasting power-seeking behavior from RL training.
As a born-and-raised AI alignment theorist, I greatly enjoyed mixing psychology, neuroscience, and AI into a blender to yield shard theory. The shard theory basically postulates that:
Deep learning policies3 are functions of intermediate abstractions (e.g. whether a sentence relates to cars),
Decision-making influences specialize as a function of these abstractions (e.g. the policy learns to increase positive or negative sentiment when the car sentence feature is active). These influential circuits are called “shards.”
The system’s overall behavior is computed as an ensemble of shards.
For example, to predict what someone will do in a situation (like seeing their mother again), you wouldn’t try to compute the optimal actions relative to some fixed life goal. In other words, the person’s behavior is not well-described as maximization of a fixed utility function. This non-descriptiveness is a well-known problem with the “subjective expected utility” theory of human decision-making.
Instead, you can consider what ensemble of shards will activate. How did they feel the last time they saw their mother—was it a positive reinforcement event? Are they a shy person? Will they be tired? Each of these factors influences behavior (somewhat independently). Later investigationsuggested that similar shard-based reasoning helps predict AI generalization.
In early 2022, Quintin Pope and I noticed glaring problems at the heart of “classical” alignment arguments. We thought through the problem with fresh eyes and derived shard theory.
Classical arguments focus on what the goal of an AI will be. Why? There’s not a good answer that I’ve ever heard. Shard theory redirects our attention from fixed single objectives. The basic upshot of shard theory: AIs and humans are well-understood as having a bunch of situationally activated goals—“shards” of desire and preference.
For example, you probably care more about people you can see. Shard theory predicts this outcome. Consider your learned decision-making circuits which bid for actions which care for your friend Bill. These circuits were probably formed when you were able to see Bill (or perhaps the vast majority of your “caring about people” circuits were formed when physically around people). If you can see Bill, that situation is more “similar to the training distribution” for your “caring about Bill” shard. Therefore, the Bill shard is especially likely to fire when you can see him.
I think shard theory is broadly correct. However, Quintin and I never got too far into the (still unexplored) interesting aspects of the theory because we underestimated the work needed to explain the initial intuitions. For example, somehow reward is not the optimization target remains (in my opinion) not fully understood among the readership. I’m not sure what I should have done differently, but it’s probably something (and not “nothing”).
I wish I had more clearly outlined my claims in a neat, propositional manner. Syllogisms seem easier to critique. It also seems easier to tell when you’re messing up! I also wish that I’d called it the “shard frame”, not the “shard theory.” That confused some folks. I think a formal shard theory is possible—I hope to supervise work formalizing shard theory itself.
On another note, shard theory is a less natural fit for llm chatbots than for agentic systems. At least, it feels harder to reason about the shards comprising Gemini Pro’s “motivations”, compared to reasoning about human shards or shards in a maze-solving policy network. I still think shard theory is a highly productive frame for llms, it just isn’t as obvious what the shards should be.
I really enjoyed working with Quintin to generate shard theory. That said, in the end of 2022, I switched to empirical work because:
The AI world is moving quickly and I want to make a concrete impact,
I got emotionally tired of arguing with people on LessWrong, and
Andrew Critch persuaded me that arguing on the internet is not that productive.
Andrew Critch (according to my memory)
If you want people to buy your models [of how the world works], go and do something they don’t know how to do. Then come back and show them what you can do. Someone will ask you how you did it, and that’s the point where you can say “well, thanks to shard theory…”
Locally retargeting the search by modifying a single activation. We found a residual channel halfway through a maze-solving network. When we set one of the channel activations to +5.5, the agent often navigates to the maze location (shown above in red) implied by that positive activation. This allows limited on-the-fly redirection of the net’s goals by modifying only a single activation! For more, read our paper.
We had gpt-4 generate dozens of strings which “look like they could have been in gpt-2’s training corpus”, in addition to a few hand-written strings. We ran these strings through the model and recorded the norms of each residual stream, across layers and sequence positions.
I propose that transformer circuits cannot skip more than a few layers at a time due to norm growth. Joseph Miller’s initial results support this hypothesis.
In 2023, I popularized steering vectors4 as a cheap way to control model outputs at inference time. I first discovered the cheese vector in a maze-solving RL environment:
(a) Original probabilities
(b) Steered probabilities
(c) Steered minus original
Left: The net probability vectors induced by the unmodified forward passes. Middle: After subtracting the cheese vector, we plot the new probability vectors induced by the modified forward passes. Right: The agent now heads away from the cheese.
Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model’s capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as “Love” versus “Hate”) to compute a steering vector.
By tactically adding in e.g. the “Love” − “Hate” steering vector during the forward pass, we achieve sota on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and opt. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.
In light of Anthropic’s viral “Golden Gate Claude” activation engineering, I want to come back and claim the points I earned [in this post].
I was extremely prescient in predicting the importance and power of activation engineering (then called “avec”). InJanuary2023, right after running the cheese vector as myfirstidea for what to do to interpret the network, and well before anyone ran llm steering vectors…I had only seen the cheese-hiding vector work on a few mazes. Given that (seemingly) tiny amount of evidence, Iimmediately wrote down 60% credence that the technique would be a big deal for llms…
We conducted this research at Google DeepMind. This post accompanies the full paper, which is available on Arxiv.
You’re absolutely right to start reading this post! What a perfectly rational decision!
Even the smartest models’ factuality or refusal training can be compromised by simple changes to a prompt. Models often praise the user’s beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within special text (jailbreaking). Normally, we fix these problems with Supervised Finetuning (sft) on static datasets showing the model how to respond in each context. While sft is effective, static datasets get stale: they can enforce outdated guidelines (specification staleness) or be sourced from older, less intelligent models (capability staleness).
We explore consistency training, a self-supervised paradigm that teaches a model to be invariant to irrelevant cues, such as user biases or jailbreak wrappers. Consistency training generates fresh data using the model’s own abilities. Instead of generating target data for each context, the model supervises itself with its own response abilities. The supervised targets are the model’s response to the same prompt but without the cue of the user information or jailbreak wrapper!
Basically, we optimize the model to react as if that cue were not present. Consistency training operates either on the level of outputs (Bias-augmented Consistency Training (bct) from Chua et al., (2025)) or on the level of internal activations (Activation Consistency Training (act), which we introduce). Our experiments show act and bct beat baselines and improve the robustness of models like Gemini 2.5 Flash.
Consistency training doesn’t involve stale datasets or separate target-response generation. Applying consistency seems more elegant than static sft. Perhaps some alignment problems are better viewed not in terms of optimal responses, but rather as consistency issues.
Activation Consistency Training applies an L2 loss to the residual stream activation differences.
We introduce a method for eliciting latent behaviors in language models by learning unsupervised perturbations of an early layer of an llm. These perturbations are trained to maximize changes in downstream activations. The method discovers diverse and meaningful behaviors with just one prompt, including perturbations overriding safety training, eliciting backdoored behaviors and uncovering latent capabilities.
We introduce a new framework for mechanistically eliciting latent behaviors in llms. In particular, we propose deep causal transcoding—modeling the effect of causally intervening on the residual stream of a deep (e.g. ≳10-layer) slice of a transformer, using a shallow mlp. We find that the weights of these mlps are highly interpretable—input directions serve as diverse and coherently generalizable steering vectors, while output directions induce predictable changes in model behavior via directional ablation.
Neural networks are oft dismissed as “inscrutable”—hopeless messes which we’d be lucky to learn basic facts about through laborious interpretability. Gradient routing strikes back by enabling (limited, apparently scalable) control over where networks learn selected capabilities.
Neural networks are trained primarily based on their inputs and outputs, without regard for their internal mechanisms. These neglected mechanisms determine properties that are critical for safety, like (i) transparency; (ii) the absence of sensitive information or harmful capabilities; and (iii) reliable generalization of goals beyond the training distribution. To address this shortcoming, we introduce gradient routing, a training method that isolates capabilities to specific subregions of a neural network. Gradient routing applies data-dependent, weighted masks to gradients during backpropagation. These masks are supplied by the user in order to configure which parameters are updated by which data points.
We show that gradient routing can be used to (1) learn representations which are partitioned in an interpretable way; (2) enable robust unlearning via ablation of a pre-specified network subregion; and (3) achieve scalable oversight of a reinforcement learner by localizing modules responsible for different behaviors. Throughout, we find that gradient routing localizes capabilities even when applied to a limited, ad-hoc subset of the data. We conclude that the approach holds promise for challenging, real-world applications where quality data are scarce.
By masking gradient updates, gradient routing controls which datapoints update which parameters. The result: Coarse-grained localization of where capabilities (like virology knowledge or goal pursuit) are learned.
Current “unlearning” methods onlysuppresscapabilitiesinstead of truly unlearning the capabilities. But if you distill an unlearned model into a randomly initialized model, the resulting network is actually robust to relearning. We show why this works, how well it works, and how to trade off compute for robustness. Since labs already distill some models before deployment, our work implies they might achieve robust unlearning “for free” on those models by simply applying an unlearning step before distillation.
Unlearn-and-Distill applies unlearning to a bad behavior and then distills the unlearned model into a new model. Distillation makes it way harder to retrain the new model to do the bad thing.
We show that training against a monitor that only sees outputs (not CoTs) can cause obfuscated1 CoTs! The obfuscation happens in two ways:
When a model is trained to produce a safe-looking output, that model may generalize to making its CoTs look safe.
Since later tokens are conditioned on earlier ones, safe‑looking CoTs may increase the likelihood of safe outputs, causing safe-looking CoTs to be reinforced.
We introduce two mitigations and they work reasonably well. Compared to regular training, our mitigations improve monitorability with comparable or better task performance in two of our three settings. Overall, our work suggests that preserving CoT monitorability is harder than previously thought.
Of course, all of my hot takes are my own, not Google’s. ⤴
I wish that “instrumental convergence” had instead been named “robust instrumentality.” “Instrumental convergence” strangely implies that the convergence is instrumental. That doesn’t make sense. Instead, certain actions are instrumental for most goals. So “convergent instrumentality” is better.
Next, “convergence” connotes “gradual progress towards a certain destination.” This temporal connotation also doesn’t make sense. Optimal policies are timeless—they just are. However, these actions are robustly instrumental across goals. That’s better! Thanks to Andrew Critch for pushing me to precision on this point.
However, it’s too late for the alternative terminology to catch on. May this be a lesson to those who coin new terms: finely weigh the available options and find a phrase which is informative and precisely correct. Don’t just vibe! ⤴
“Deep learning systems” meaning something like “systems trained via RL and / or predictive learning.” Naturally, this includes both the brain and llms. ⤴