Table of contents
This post treats reward functions as “specifying goals”, in some sense. As I explained in Reward Is Not The Optimization Target, this is a misconception that can seriously damage your ability to understand how AI works. Rather than “incentivizing” behavior, reward signals are (in many cases) akin to a perdatapoint learning rate. Reward chisels circuits into the AI. That’s it!
As of 2024, I no longer endorse this postEven though this post presents correct and engaging technical explanations, its speculation seems wrong. For example, deceptive alignment is not known to be “prevalent.”
Key takeaways
 The structure of the agent’s environment often causes instrumental convergence. In many situations, there are (potentially combinatorially) many ways for powerseeking to be optimal, and relatively few ways for it not to be optimal.
 My previous results said something like: in a range of situations, when you’re maximally uncertain about the agent’s objective, this uncertainty assigns high probability to objectives for which powerseeking is optimal.
 My new results prove that in a range of situations, seeking power is optimal for most agent objectives (for a particularly strong formalization of “most”). More generally, the new results say something like: in a range of situations, for most beliefs you could have about the agent’s objective, these beliefs assign high probability to reward functions for which powerseeking is optimal.
 This is the first formal theory of the statistical tendencies of optimal policies in reinforcement learning.
 One result says: whenever the agent maximizes average reward, then for any reward function, most permutations of it incentivize shutdown avoidance.
 The formal theory is now beginning to explain why alignment is so hard by default, and why failure might be catastrophic.
 Before, I thought of environmental symmetries as convenient sufficient conditions for instrumental convergence. But I increasingly suspect that symmetries are the main part of the story.
 I think these results may be important for understanding the AI alignment problem and formally motivating its difficulty.
 For example, my results imply that simplicity priors over reward functions assign nonnegligible probability to reward functions for which powerseeking is optimal.
 I expect my symmetry arguments to help explain other “convergent” phenomena, including:
 convergent evolution
 the prevalence of deceptive alignment
 feature universality in deep learning
 One of my hopes for this research agenda: if we can understand exactly why superintelligent goaldirected objective maximization seems to fail horribly, we might understand how to do better.
ThanksThanks to
TheMajor
, Rafe Kennedy, and John Wentworth for feedback on this post. Thanks for Rohin Shah and Adam Shimi for feedback on the simplicity prior result.
One view on agi risk is that we’re charging ahead into the unknown, into a particularly unfair game of Minesweeper in which the first click is allowed to blow us up. Following the analogy, we want to understand enough about the mine placement so that we don’t get exploded on the first click. And once we get a foothold, we start gaining information about other mines, and the situation is a bit less dangerous.
My previous theorems on powerseeking said something like: “at least half of the tiles conceal mines.”
I think that’s important to know. But there are many tiles you might click on first. Maybe all of the mines are on the right, and we understand the obvious pitfalls, and so we’ll just click on the left.
That is: we might not uniformly randomly select tiles:
 We might click a tile on the left half of the grid.
 Maybe we sample from a truncated discretized Gaussian.
 Maybe we sample the next coordinate by using the universal prior (rejecting invalid coordinate suggestions).
 Maybe we uniformly randomly load LessWrong posts and interpret the first text bits as encoding a coordinate.
There are lots of ways to sample coordinates, besides uniformly randomly. So why should our sampling procedure tend to activate mines?
My new results say something analogous to: for every coordinate, either it contains a mine, or its reflection across $x=y$ contains a mine, or both. Therefore, for every distribution $D$ over tile coordinates, either $D$ assigns at least 1/2 probability to mines, or it does after you reflect it across $x=y$.
Definition: OrbitThe orbit of a coordinate $C$ under the symmetric group $S_{2}$ is ${C,C_{reflected}}$. More generally, if we have a probability distribution over coordinates, its orbit is the set of all possible “permuted” distributions.
Orbits under symmetric groups quantify all ways of “changing things around” for that object.
Since my results (in the analogy) prove that at least one of the two blue coordinates conceals a mine, we deduce that the mines are not all on the right.
Some reasons we care about orbits:
 As we will see, orbits highlight one of the key causes of instrumental convergence: certain environmental symmetries (which are, mathematically, permutations in the state space).
 Orbits partition the set of all possible reward functions. If at least half of the elements of every orbit induces powerseeking behavior, that’s strictly stronger than showing that at least half of reward functions incentivize powerseeking (technical note: with the second “half” being with respect to the uniform distribution’s measure over reward functions).
 In particular, we might have hoped that there were particularly nice orbits, where we could specify objectives without worrying too much about making mistakes (like permuting the output a bit). These nice orbits are impossible. This is some evidence of a fundamental difficulty in reward specification.
 Permutations are wellbehaved and help facilitate further results about powerseeking behavior. In this post, I’ll prove one such result about the simplicity prior over reward functions.
In terms of coordinates, one hope could have been:
Sure, maybe there’s a way to blow yourself up, but you’d really have to contort yourself into a pretzel in order to algorithmically select such a bad coordinate: all reasonably simple selection procedures will produce safe coordinates.
But suppose you give me a program $P$ which computes a safe coordinate. Let $P_{′}$ call $P$ to compute the coordinate, and then have $P_{′}$ swap the entries of the computed coordinate. $P_{′}$ is only a few bits longer than $P$, and it doesn’t take much longer to compute, either. So the above hope is impossible: safe mineselection procedures can’t be significantly simpler or faster than unsafe mineselection procedures.
The section “Simplicity priors assign nonnegligible probability to powerseeking” proves something similar about objective functions.
Orbits of goals consist of all the ways of permuting what states get which values. Consider this rewardless Markov decision process (mdp):
Whenever staying put at $A$ is strictly optimal, you can permute the reward function so that it’s strictly optimal to go to $B$. For example, let $R(A)=def1,R(B)=def0$ and let $ϕ=def(AB)$ swap the two states. $ϕ$ acts on $R$ as follows: $ϕ⋅R$ simply permutes the state before evaluating its reward: $(ϕ⋅R)(s)=defR(ϕ(s))$.
The orbit of $R$ is ${R,ϕ⋅R}$. It’s optimal for the former to stay at $A$, and for the latter to alternate between the two states.
In this threestate mdp, let $R_{C}$ assign 1 reward to $C$ and 0 to all other states, and let $ϕ=def(ABC)$ rotate through the states ($A$ goes to $B$, $B$ goes to $C$, $C$ goes to $A$). Then the orbit of $R_{C}$ is:
$C$  $A$  $B$  

$R_{C}$  $1$  $0$  $0$ 
$ϕ⋅R_{C}$  $0$  $1$  $0$ 
$ϕ_{2}⋅R_{C}$  $0$  $0$  $1$ 
My new theorems prove that in many situations, for every reward function, powerseeking is incentivized by most (at least half) of its orbit elements.
In Seeking Power is Often Robustly Instrumental in mdps, the last example involved gems and dragons and (most exciting of all) subgraph isomorphisms:
Sometimes, one course of action gives you “strictly more options” than another. Consider another mdp with iid reward: The right blue gem subgraph contains a “copy” of the upper red gem subgraph. From this, we can conclude that going right to the blue gems... is more probable under optimality for all discount rates between 0 and 1!
We say that $ϕ$ is an environmental symmetry, because $ϕ$ is an element of the symmetric group $S_{∣S∣}$ of permutations on the state space.
Let’s pause for a moment. For half a year, I intermittently and fruitlessly searched for some way of extending the original results beyond iid reward distributions to account for arbitrary reward function distributions.
 Part of me thought it had to be possible—how else could we explain instrumental convergence?
 Part of me saw no way to do it. Reward functions differ wildly, how could a theory possibly account for what “most of them” incentivize?
The recurring thought which kept my hope alive was:
There should be “more ways” for
bluegems
to be optimal overredgems
, than forredgems
to be optimal overbluegems
.
Then I reconsidered the same state permutation $ϕ$ which proved my original iidreward theorems. That kind of $ϕ$ would imply that since bluegems
has more options, there is therefore greater optimality probability (under iid reward function distributions) for moving toward the blue gems. In the end, that same permutation $ϕ$ holds the key to understanding instrumental convergence in mdps.
Consider any discount rate $γ∈(0,1)$. For all reward functions $R$ such that $V_{R}(redgems,γ)>V_{R}(bluegems,γ)$, this permutation $ϕ$ turns them into bluegem
lovers: $V_{ϕ⋅R}(redgems,γ)<V_{ϕ⋅R}(bluegems,γ)$.
$ϕ$ takes nonpowerseeking reward functions, and injectively maps them to powerseeking orbit elements. Therefore, for all reward functions $R$, at least half of the orbit of $R$ must agree that bluegems
is optimal!
Throughout this post, when I say “most” reward functions incentivize something, I mean the following:
DefinitionAt state $s$, most reward functions incentivize action $a$ over action $a_{′}$ when for all reward functions $R$, at least half of the orbit agrees that $a$ has at least as much action value as $a_{′}$ does at state $s$.^{1}
The same reasoning applies to distributions over reward functions. And so if you say “we’ll draw reward functions from a simplicity prior”, then most permuted distributions in that prior’s orbit will incentivize powerseeking in the situations covered by my previous theorems. (And we’ll later prove that simplicity priors themselves must assign nontrivial, positive probability to powerseeking reward functions.)
Furthermore, for any distribution which distributes reward “fairly” across states (precisely: independently and identically), their (trivial) orbits unanimously agree that bluegems
has strictly greater probability of being optimal. And so the converse isn’t true: it isn’t true that at least half of every orbit agrees that redgems
has more power and greater probability of being optimal.
This might feel too abstract, so let’s run through examples.
Even though randomly generated environments are unlikely to satisfy these sufficient conditions for powerseeking tendencies, the results are easy to apply to many structured environments common in reinforcement learning. For example, when $γ≈1$, most reward functions provably incentivize not immediately dying in PacMan. Every reward function which incentivizes dying right away can be permuted into a reward function for which survival is optimal.
Most importantly, we can prove that when shutdown is possible, optimal policies try to avoid it if possible. When the agent isn’t discounting future reward (i.e. maximizes average return) and for action encodings, the mdp structure has the right symmetries to ensure that it’s instrumentally convergent to avoid shutdown.
The paper’s discussion sectionCorollary 6.14 dictates where averageoptimal agents tend to end up, but not how they get there. Corollary 6.14 says that such agents tend not to stay in any given 1cycle. It does not say that such agents will avoid entering such states. For example, in an embodied navigation task, a robot may enter a 1cycle by idling in the center of a room. Corollary 6.14 implies that averageoptimal robots tend not to idle in that particular spot, but not that they tend to avoid that spot entirely.
However, averageoptimal robots do tend to avoid getting shut down. The agent’s rewardless mdp often represents agent shutdown with a terminal state. A terminal state is unable to access other 1cycles. Since Corollary 6.14 shows that averageoptimal agents tend to end up in other 1cycles, averageoptimal policies must tend to completely avoid the terminal state. Therefore, we conclude that in many such situations, averageoptimal policies tend to avoid shutdown.
What does “most reward functions” mean quantitatively—is it just at least half of each orbit? Or, are there situations where we can guarantee that at least threequarters of each orbit incentivizes powerseeking? I think we should be able to prove that as the environment gets more complex, there are combinatorially more permutations which enforce these similarities, and so the orbits should skew harder and harder towards powerincentivization.
I don’t yet understand the general case, but I have a strong hunch that instrumental convergence_{optimal policies} is governed by how many more ways there are for power to be optimal than not optimal. And this seems like a function of the number of environmental symmetries which enforce the appropriate embedding.
One possible hope would have been:
Sure, maybe there’s a way to blow yourself up, but you’d really have to contort yourself into a pretzel in order to algorithmically select a powerseeking reward function. In other words, reasonably simple reward function specification procedures will produce nonpowerseeking reward functions.
Unfortunately, there are always powerseeking reward functions not much more complex than their nonpowerseeking counterparts. Here, “powerseeking” corresponds to the intuitive notions of either keeping strictly more options open (Proposition 6.9), or navigating towards larger sets of terminal states (theorem 6.13). (Since this applies to several results, I’ll leave the meaning a bit ambiguous, with the understanding that it could be formalized if necessary.)
Theorem: Simplicity priors assign nonnegligible probability to powerseekingConsider any mdp which meets the preconditions of Proposition 6.9 or theorem 6.13. Let $U$ be a universal Turing machine, and let $P_{U}$ be the $U$simplicity prior over computable reward functions.
Let
NPS
be the set of nonpowerseeking computable reward functions which choose a fixed nonpowerseeking action in the given situation. Let $PS$ be the set of computable reward functions for which seeking power is strictly optimal.^{3}Then there exists a “reasonably small” constant $C$ such that $P_{U}(PS)≥2_{−C}P_{U}(nps)$, where $C$ .
Proof sketch: Let $ϕ$ be an environmental symmetry which satisfies the powerseeking theorem in question. Since $ϕ$ can be found by bruteforce iteration through all $∣S∣!$ permutations on the state space, checking each to see if it meets the formal requirements of the relevant theorem, its Kolmogorov complexity $K_{U}(ϕ)$ is relatively small.
Because Lemma D.26 applies in these situations, $ϕ(nps)⊆PS$: $ϕ$ turns nonpowerseeking reward functions into powerseeking ones. Thus, $P_{U}(PS)≥P_{U}(ϕ(nps))$.
Since each reward function $R∈ϕ(nps)$ can be computed by computing the nonpowerseeking variant and then permuting it (with $K_{U}(ϕ)$ extra bits of complexity), $K_{U}(R)≤K_{U}(ϕ_{−1}(R))+K_{U}(ϕ)+O(1)$ (with $O(1)$ counting the small number of extra bits for the code which calls the relevant functions).
Since $P_{U}$ is a simplicity prior, $P_{U}(ϕ(nps))≥2_{−(K_{U}(ϕ)+O(1))}P_{U}(nps)$.
Then $P_{U}(PS)≥2_{−(K_{U}(ϕ)+O(1))}P_{U}(nps)$. qed.
 Why can’t we show that $P_{U}(PS)≥P_{U}(nps)$?
 Certain utms $U$ might make nonpowerseeking reward functions particularly simple to express.
 This proof doesn’t assume anything about how many more options powerseeking offers than notpowerseeking. The proof only assumes the existence of a single involutive permutation $ϕ$.
 This lower bound seems rather weak. Even if $K_{U}(ϕ)+O(1)=15$ bits, $2_{−15}≈0$.
 This lower bound is indeed loose. Since most individual nps probabilities of interest are less than 1 in one trillion, I wouldn’t be surprised if the bound were loose by at least several orders of magnitude.
 First of all, the bound implicitly assumes that the only way to compute PS reward functions is by taking nps ones and permuting them. We should add the other ways of computing PS reward functions to $P_{U}(PS)$.
 There are lots of permutations $ϕ_{′}$ we could use. $P_{U}(PS)$ gains probability from all of those terms. Some of these terms are probably reasonably large, since it seems implausible that all such permutations $ϕ_{′}$ have high Kcomplexity. When all is said and done, we may well end up with a significant chunk of probability on PS.
 For example: the symmetric group $S_{∣S∣}$ has cardinality $∣S∣!$, and for any $R∈nps$, at least half of the $ϕ_{′}∈S_{∣S∣}$ induce (weakly) powerseeking orbit elements $ϕ_{′}⋅R$. (This argument would be strengthened by my conjectures about bigger environments $⟹$ greater fraction of orbits seek power.)
 If some significant fraction (e.g. $501 $) of these $ϕ_{′}$ are strictly powerseeking, we’re adding at least $2∣S∣! 501 =100∣S∣! $ additional terms.

Overall, it’s not surprising that the bound is loose, given the lack of assumptions about the degree of powerseeking in the environment. If the bound is anywhere near tight, then the permuted simplicity prior $ϕ⋅P_{U}$ incentivizes powerseeking with extremely high probability.^{4}
 What if $P_{U}(nps)=0$?
 I think this is impossible, and I can prove that in a range of situations, but it would be a lot of work and it relies on results not in the arxiv paper.
 Even if that equation held, that would mean that powerseeking is (at least weakly) optimal for all computable reward functions. That’s hardly a reassuring situation. Note that if $P_{U}(nps)>0$, then $P_{U}(PS)>0$.
 Most plainly, this seems like reasonable formal evidence that the simplicity prior has malign incentives.
 Powerseeking reward functions don’t have to be too complex.
 These powerseeking theorems give us important tools for reasoning formally about powerseeking behavior and its prevalence in important reward function distributions.^{5}
if you know that an agent is maximizing the expectation of an explicitly represented utility function, I would expect that to lead to goaldriven behavior most of the time, since the utility function must be relatively simple if it is explicitly represented, and simple utility functions seem particularly likely to lead to goaldirected behavior.
On its own, Goodhart’s law doesn’t explain why optimizing proxy goals leads to catastrophically bad outcomes, instead of just lessthanideal outcomes.
I think that we’re now starting to have this kind of understanding. I suspect that powerseeking is why capable, goaldirected agency is so dangerous by default. If we want to consider more benign alternatives to goaldirected agency, then deeply understanding the rot at the heart of goaldirected agency is important for evaluating alternatives. This work lets us get a feel for the generic incentives of reinforcement learning at optimality.
For every reward function $R$—no matter how benign, how aligned with human interests, no matter how poweraverse—either $R$ or its permuted variant $ϕ⋅R$ seeks power in the given situation (intuitivepower, since the agent keeps its options open, and also formalpower, according to my proofs).
If I let myself be a bit more colorful, every reward function has lots of “evil” powerseeking variants (do note that the step from “powerseeking” to “misaligned powerseeking” requires more work). If we imagine ourselves as only knowing the orbit of the agent’s objective, then the situation looks a bit like this:
Of course, this isn’t how reward specification works—we probably are far more likely to specify certain orbit elements than others. However, the formal theory is now beginning to explain why alignment is so hard by default, and why failure might be catastrophic!
The structure of the environment often ensures that there are (potentially combinatorially) many more ways to misspecify the objective so that it seeks power, than there are ways to specify goals without powerseeking incentives.
I’m optimistic that these symmetry arguments will help us better understand a range of different tendencies. The common thread seems like: For every “way” a thing could not happen / not be a good idea—there are many more “ways” in which it could happen / be a good idea.
 convergent evolution
 flight has independently evolved several times, suggesting that flight is adaptive in response to a wide range of conditions.
WikipediaIn his 1989 book Wonderful Life, Stephen Jay Gould argued that if one could “rewind the tape of life [and] the same conditions were encountered again, evolution could take a very different course.” Simon Conway Morris disputes this conclusion, arguing that convergence is a dominant force in evolution, and given that the same environmental and physical constraints are at work, life will inevitably evolve toward an “optimum” body plan, and at some point, evolution is bound to stumble upon intelligence, a trait presently identified with at least primates, corvids, and cetaceans.

the prevalence of deceptive alignment
 given inner misalignment, there are (potentially combinatorially) many more unaligned terminal reasons to lie (and survive), and relatively few unaligned terminal reasons to tell the truth about the misalignment (and be modified).

 computer vision networks reliably learn edge detectors, suggesting that this is instrumental (and highly learnable) for a wide range of labelling functions and datasets.
You have to be careful in applying these results to argue for realworld AI risk from deployed systems.

They assume the agent is following an optimal policy for a reward function
 I can relax this to $ϵ$optimality, but $ϵ>0$ may be extremely small

They assume the environment is finite and fully observable

Not all environments have the right symmetries
 But most ones we think about seem to

The results don’t account for the ways in which we might practically express reward functions
 For example, often we use featurized reward functions. While most permutations of any featurized reward function will seek power in the considered situation, those permutations need not respect the featurization (and so may not even be practically expressible).

When I say “most objectives seek power in this situation”, that means in that situation  it doesn’t mean that most objectives take the powerseeking move in most situations in that environment
 The combinatorics conjectures will help prove the latter
This list of limitations has steadily been getting shorter over time.
I think that this work is beginning to formally explain why slightly misspecified reward functions will probably incentivize misaligned powerseeking. Here’s one hope I have for this line of research going forwards:
One naïve alignment approach involves specifying a goodseeming reward function, and then having an AI maximize its expected discounted return over time. For simplicity, we could imagine that the AI can just instantly compute an optimal policy.
Let’s precisely understand why this approach seems to be so hard to align, and why extinction seems to be the cost of failure. We don’t yet know how to design beneficial AI, but we largely agree that this naïve approach is broken. Let’s prove it.
Find out when I post more content: newsletter & RSS
alex@turntrout.com

This is actually a bit weaker than what I prove in the paper, but it’s easier to explain in words. ⤴

“Suggest” instead of “prove” because E.49’s preconditions may not always be met, depending on the details of the dynamics. I think this is probably unimportant, but that’s for future work. Also, the argument may barely not apply to this gridworld, but if you could move the vase around without destroying it, I think it goes through fine. ⤴

There are reward functions for which it’s optimal to seek power and not to seek power; for example, constant reward functions make everything optimal, and they’re certainly computable. Therefore, $nps∪PS$ is a strict subset of the whole set of computable reward functions. ⤴

If you think about the permutation as a “way reward could be misspecified”, then that’s troubling. It seems plausible that this is often (but not always) a reasonable way to think about the action of the $ϕ$ permutation. ⤴

If I had to guess, this result is probably not the best available bound, nor the most important corollary of the powerseeking theorems. But I’m still excited by it (insofar as it’s appropriate to be “excited” by slight Bayesian evidence of doom). ⤴