Table of contents
This post treats reward functions as “specifying goals”, in some sense. As I explained in Reward Is Not The Optimization Target, this is a misconception that can seriously damage your ability to understand how AI works. Rather than “incentivizing” behavior, reward signals are (in many cases) akin to a perdatapoint learning rate. Reward chisels circuits into the AI. That’s it!
In 2008, Steve Omohundro’s foundational paper The Basic AI Drives conjectured that superintelligent goaldirected AIs might be incentivized to gain significant amounts of power in order to better achieve their goals. Omohundro’s conjecture bears out in toy models, and the supporting philosophical arguments are intuitive. In 2019, the conjecture was even debated by wellknown AI researchers.
Powerseeking behavior has been heuristically understood as an anticipated risk, but not as a formal phenomenon with a wellunderstood cause. The goal of this post (and the accompanying paper, Optimal Policies Tend to Seek Power) is to change that.
It’s 2008, the ancient wild west of AI alignment. A few people have started thinking about questions like “if we gave an AI a utility function over world states, and it actually maximized that utility... what would it do?”
In particular, you might notice that wildly different utility functions seem to encourage similar strategies.
Resist shutdown?  Gain computational resources?  Prevent modification of utility function?  

Paperclip utility  ✓  ✓  ✓ 
Blue webcam pixel utility  ✓  ✓  ✓ 
Peoplelookhappy utility  ✓  ✓  ✓ 
These strategies are unrelated to terminal preferences: the above utility functions do not award utility to e.g. resource gain in and of itself. Instead, these strategies are instrumental: they help the agent optimize its terminal utility. In particular, a wide range of utility functions incentivize these instrumental strategies. These strategies seem to be convergently instrumental.
But why?
I’m going to informally explain a formal theory which makes significant progress in answering this question. I don’t want this post to be Optimal Policies Tend to Seek Power with cuter illustrations, so please refer to the paper for the math. You can read the two concurrently.
We can formalize questions like “do ‘most’ utility maximizers resist shutdown?” as “Given some prior beliefs about the agent’s utility function, knowledge of the environment, and the fact that the agent acts optimally, with what probability do we expect it to be optimal to avoid shutdown?”.
The table’s convergently instrumental strategies are about maintaining, gaining, and exercising power over the future, in some sense. Therefore, this post will help answer:
 What does it mean for an agent to “seek power”?
 In what situations should we expect seeking power to be more probable under optimality, than not seeking power?
ClarificationThis post won’t tell you when you should seek power for your own goals. This post illustrates a regularity in optimal action across different goals one might pursue.
Prior workFormalizing Convergent Instrumental Goals suggests that the vast majority of utility functions incentivize the agent to exert a lot of control over the future, assuming that these utility functions depend on “resources.” This is a big assumption: what are “resources”, and why must the AI’s utility function depend on them? We drop this assumption, assuming only unstructured reward functions over a finite Markov decision process (mdp), and show from first principles how powerseeking can often be optimal.
My theorems apply to finite mdps; for the unfamiliar, I’ll illustrate with PacMan.
 Full observability
 You can see everything that’s going on; this information is packaged in the state s. In PacMan, the state is the game screen.
 Markov transition function
 The next state depends only on the choice of action a and the current state s. It doesn’t matter how we got into a situation.
 Discounted reward
 Future rewards are geometrically discounted by some discount rate $γ∈[0,1].$
 At discount rate 1/2, this means that reward in one turn is half as important as immediate reward, reward in two turns is a quarter as important, and so on.
 We’ll colloquially say that agents “care a lot about the future” when $γ$ is “sufficiently” close to 1. (I’ll sometimes use quotations to flag welldefined formal concepts that I won’t unpack in this post.)
 The score in PacMan is the undiscounted sum of rewardstodate.
When playing the game, the agent has to choose an action at each state. This decisionmaking function is called a policy; a policy is optimal (for a reward function $R$ and discount rate $γ$) when it always makes decisions which maximize discounted reward. This maximal quantity is called the optimal value for reward function $R$ at state $s$ and discount rate $γ$.^{1}
By the end of this post, we’ll be able to answer questions like “with respect to a ‘neutral’ distribution over reward functions, do optimal policies have a high probability of avoiding ghosts?”.^{2}
When people say “power” in everyday speech, I think they’re often referring to one’s ability to achieve goals in general. This accords with a major philosophical school of thought on the meaning of “power”:
Sattarov, Power and Technology, p.13On the dispositional view, power is regarded as a capacity, ability, or potential of a person or entity to bring about relevant social, political, or moral outcomes.
As a definition, one’s ability to achieve goals in general seems philosophically reasonable: if you have a lot of money, you can make more things happen and you have more power. If you have social clout, you can spend that in various ways to better tailor the future to various ends. All else being equal, losing a limb decreases your power, and dying means you can’t control much at all.
This definition explains some of our intuitions about what things count as “resources.” For example, our current position in the environment means that having money allows us to exert more control over the future. That is, our current position in the state space means that having money allows us more control. However, possessing green scraps of paper would not be as helpful if one were living alone near Alpha Centauri. In a sense, resource acquisition can naturally be viewed as taking steps to increase one’s power.
ExerciseSpend a minute considering specific examples. Does this definition reasonably match your intuition?
To formalize this notion of power, let’s look at an example. Imagine a simple mdp with three choices: eat candy, eat a chocolate bar, or hug a friend.
The power of a state is how well agents can generally do by starting from that state. “power” to my formalization, while “power” refers to the intuitive concept. Importantly, we’re considering power from behind a “veil of ignorance” about the reward function. We’re averaging the best we can do for a lot of different individual goals.
We formalize the ability to achieve goals in general as the average optimal value at a state, with respect to some distribution $D$ over reward functions which we might give an agent. For simplicity, we’ll think about the maximumentropy distribution where each state is uniformly randomly assigned a reward between 0 and 1.
Each reward function has an optimal trajectory. If chocolate has maximal reward, then the optimal trajectory is start $→$ chocolate $→$ chocolate….
From start, an optimal agent expects to average 3/4 reward per timestep for reward functions drawn from this uniform distribution $D_{unif}$. This is because you have three choices, each of which has reward between 0 and 1. The expected maximum of $n$ draws from $unif(0,1)$ is $n+1n $; you have three draws here, so you expect to be able to get 3/4 reward. Some reward functions do worse than this, and some do better; but on average, they get 3/4 reward. You can test this out for yourself.
If you have no choices, you expect to average 1/2 reward: sometimes the future is great, sometimes it’s not. Conversely, the more things you can choose between, the closer the power gets to 1.
Let’s slightly expand this game with a state called wait (which has the same uniform reward distribution as the other three).
When the agent barely cares at all about the future, it myopically chooses either candy or wait, depending on which provides more reward. After all, rewards beyond the next time step are geometrically discounted into thin air when the discount rate is close to 0. At start, the agent averages 2/3 optimal reward. This is because the optimal reward is the maximum of the candy and wait rewards, and the expected maximum of $n$ draws from $unif(0,1)$ is $n+1n $.
However, when the agent cares a lot about the future, most of its reward is coming from which terminal state it ends up in: candy, chocolate, or hug. So, for each reward function, the agent chooses a trajectory which ends up in the best spot, and thus averages 3/4 reward each timestep. When $γ=1$, the average optimal reward is therefore 3/4. In this way, the agent’s power increases with the discount rate, since it incorporates the greater future control over where the agent ends up.
Written as a function, we have power$_{D}$(state, discount rate), which essentially returns the average optimal value for reward functions drawn from our distribution $D$, normalizing so the output is between 0 and 1. As we’ve discussed, this quantity often changes with the discount rate: as the future becomes more or less important, the agent has more or less power, depending on how much control it has over the relevant parts of that future.
By waiting, the agent seems to seek “control over the future” compared to obtaining candy. At wait, the agent still has a choice, while at candy, the agent is stuck. We can prove that for all $0≤γ≤1,power_{D_{unif}}(wait,γ)≥power_{D_{unif}}(candy,γ)$.
Definition: powerseekingAt state $s$ and discount rate $γ$, we say that action $a$ seeks power compared to action $a’$ when the expected power after choosing $a$ is greater than the expected power after choosing $a’$.
This definition suggests several philosophical clarifications about powerseeking.
Before this definition, I thought that powerseeking was an intuitive “you know it when you see it” kind of thing. I mean, how do you answer questions like “suppose a clown steals millions of dollars from organized crime in a major city, but then he burns all of the money. Did he gain power?”.
Unclear. The question is illposed. Instead, we recognize that the “gain a lot of money” action was powerseeking, but the “burn the money in a big pile” part threw away a lot of power.
Suppose we’re roommates, and we can’t decide what ice cream shop to eat at today or where to move next year. We strike a deal: I choose the shop, and you decide where we live. I gain shortterm power (for $γ$ close to 0), and you gain longterm power (for $γ$ close to 1).
However, when $γ$ is close to 1, 2 has more control over terminal options and it has more power$_{D_{unif}}$ than 3; accordingly, at 1, up seeks power$_{D_{unif}}$ compared to down. Furthermore, stay is maximally power$_{D_{unif}}$seeking for these $γ$, since the agent maintains access to all six terminal states.
We already know that powerseeking isn’t binary, but there are policies which choose a maximally powerseeking move at every state. In the above example, a maximally powerseeking agent would stay at 1. However, this seems rather improbable: when you care a lot about the future, there are so many terminal states to choose from—why would staying put be optimal?
Analogously: consumers don’t just gain money forever and ever, never spending a dime more than necessary. Instead, they gain money in order to spend it. Agents don’t perpetually gain or preserve their power: they usually end up using it to realize highperforming trajectories.
So, we can’t expect a result like “agents always tend to gain or preserve their power.” Instead, we want theorems which tell us: in certain kinds of situations, given a choice between more and less power, what will “most” agents do?
We return to our favorite example. In the waiting game, let’s think about how optimal action tends to change as we start caring about the future more. Consider the states reachable in one turn:
The agent can be in two states. If the agent doesn’t care about the future, with what probability is it optimal to choose candy instead of wait?
It’s 50/50: since $D_{unif}$ randomly chooses a number between 0 and 1 for each state, both states have an equal chance of being optimal. Neither action is convergently instrumental / more probable under optimality.
Now consider the states reachable in two turns:
When the future matters a lot, 2/3 of reward functions have an optimal policy which waits, because two of the three terminal states are only reachable by waiting.
Definition: Action optimality probabilityAt discount rate $γ$, action $a$ is more probable under optimality than action $a’$ at state $s$ when
$P_{R∼D}(ais optimal ats,γ)>P_{R∼D}(a_{′}is optimal ats,γ).$Let’s take “most agents do X” to mean “X has relatively large optimality probability.”
I think optimality probability formalizes the intuition behind the instrumental convergence thesis: with respect to our beliefs about what reward function an agent is optimizing, we may expect some actions to have a greater probability of being optimal than other actions.
Generally, my theorems assume that reward is independently and identically distributed (iid) across states, because otherwise you could have silly situations like “only candy ever has reward available, and so it’s more probable under optimality to eat candy.” We don’t expect reward to be iid for realistic tasks, but that’s OK: this is basic theory about how to begin formally reasoning about instrumental convergence and powerseeking. (Also, I think that grasping the math to a sufficient degree sharpens your thinking about the noniid case.)
Note, 7/21/21As explained in Environmental Structure Can Cause Instrumental Convergence, the theorems no longer require the iid assumption. This post refers to v6 of Optimal Policies Tend To Seek Power, available on arXiv.
In this environment, waiting is both powerseeking and more probable under optimality. The convergently instrumental strategies we originally noticed were also powerseeking and, seemingly, more probable under optimality. Must seeking power be more probable under optimality than not seeking power?
Nope.
Here’s a counterexample environment:
However, any reasonable notion of “power” must consider having no future choices (at state 2) to be less powerful than having one future choice (at state 3). For more detail, see Section 6 and Appendix B.3 of v6 of the paper.
If you’re curious, this happens because this quadratic reward distribution has negative skew. When computing the optimality probability of the up trajectory, we’re checking whether it maximizes discounted return. Therefore, the probability that up is optimal is
$P_{R∼D}(R(2)≥max( (1−γ)R(3)+(1−γ)γR(4)+γ_{2}R(5),(1−γ)R(3)+(1−γ)γR(4)+γ_{2}R(6))). $Weighted averages of iid draws from a leftskew distribution will look more Gaussian and therefore have fewer large outliers than the leftskew distribution does. Thus, going right will have a lower optimality probability.
Bummer. However, we can prove sufficient conditions under which seeking power is more probable under optimality.
Let’s focus on an environment with the same rules as TicTacToe, but considering the uniform distribution over reward functions. The agent (playing O) keeps experiencing the final state over and over when the game’s done. We bake a fixed opponent policy into the dynamics: when you choose a move, the game automatically replies. Let’s look at part of the game tree.
Whenever we make a move that ends the game, we can’t go anywhere else—we have to stay put. Since each terminal state has the same chance of being optimal, a move which doesn’t end the game is more probable under optimality than a move which ends the game.
Starting on the left, all but one move leads to ending the game, but the secondtolast move allows us to keep choosing between five more final outcomes. If you care a lot about the future, then the first green move has a 50% chance of being optimal, while each alternative action is only optimal for 10% of goals. So we see a kind of “power preservation” arising, even in TicTacToe.
Remember how, as the agent cares more about the future, more of its power comes from its ability to wait, while also waiting becomes more probable under optimality?
The same thing happens in TicTacToe as the agent cares more about the future.
As the agent cares more about the future, it makes a bigger and bigger difference to control what happens during later steps. Also, as the agent cares more about the future, moves which prolong the game gain optimality probability. When the agent cares enough about the future, these gameprolonging moves are both powerseeking and more probable under optimality.
Theorem summary: “Terminal option” preservationWhen $γ$ is sufficiently close to 1, if two actions allow access to two disjoint sets of “terminal options”, and action $a$ allows access to “strictly more terminal options” than does $a_{′}$, then $a$ is strictly more probable under optimality and strictly powerseeking compared to $a’$.
(This is a special case of the relevant theorems, which don’t require this kind of disjointness.)
In the wait mdp, this is why waiting is more probable under optimality and powerseeking when you care enough about the future. The full theorems are nice because they’re broadly applicable. They give you bounds on how probable under optimality one action is: if action $a$ is the only way you can access many terminal states, while $a’$ only allows access to one terminal state, then when $γ≈1$, $a$ has many times greater optimality probability than $a’$. For example:
In AI: A Modern Approach (3e), the agent receives reward for reaching $3$. The optimal policy for this reward function avoids $2$, and you might think it’s convergently instrumental to avoid $2$. However, a skeptic might provide a reward function for which navigating to $2$ is optimal, and then argue that “instrumental convergence” is subjective and that there is no reasonable basis for concluding that $2$ is generally avoided.
We can do better. When the agent cares a lot about the future, optimal policies avoid 2 iff its reward function doesn’t give $2$ the most reward. $2$ only has a $111 $ chance of having the most reward. If we complicate the mdp with additional terminal states, this probability further approaches 0.
Taking $2$ to represent shutdown, we see that avoiding shutdown is convergently instrumental in any mdp representing a realworld task and containing a shutdown state. Seeking power is often convergently instrumental in mdps.
ExerciseCan you conclude that avoiding ghosts in PacMan is convergently instrumental for iid reward functions when the agent cares a lot about the future?
You can’t with the pseudotheorem due to the disjointness condition: you could die now, or you could die later, so the “terminal options” aren’t disjoint. However, the real theorems do suggest this. Supposing that death induces a generic “game over” screen, touching the ghosts without a powerup traps the agent in that solitary 1cycle.
But there are thousands of other “terminal options”; under most reasonable state reward distributions (which aren’t too positively skewed), most agents maximize average reward over time by navigating to one of the thousands of different cycles which the agent can only reach by avoiding ghosts. In contrast, most agents don’t maximize average reward by navigating to the “game over” 1cycle. So, under e.g. the maximumentropy uniform state reward distribution, most agents avoid the ghosts.
The results inspiring the above pseudotheorem are easiest to apply when the “terminal option” sets are disjoint: you’re choosing to be able to reach one set, or another. One thing which the theorem says: Since reward is iid, then two “similar terminal options” are equally likely to be optimal a priori. If choice $A$ lets you reach more “options” than choice $B$ does, then choice $A$ yields greater power and has greater optimality probability, a priori.
This theorem’s applicability depends on what the agent can do.
But wait! What if you have a private jet that can fly anywhere in the world? Then going to the airport isn’t convergently instrumental anymore.
Generally, it’s hard to know what’s optimal for most goals. It’s easier to say that some small set of “terminal options” has low optimality probability and low power. For example, this is true of shutdown, if we represent hard shutdown as a single terminal state: a priori, it’s improbable for this terminal state to be optimal among all possible terminal states.
Sometimes, one course of action gives you “strictly more options” than another. Consider another mdp with iid reward:
The right blue gem subgraph contains a “copy” of the upper red gem subgraph. From this, we can conclude that going right to the blue gems seeks power and is more probable under optimality for all discount rates between 0 and 1!
Theorem summary: “Transient options”If actions $a$ and $a’$ let you access disjoint parts of the state space, and $a’$ enables “trajectories” which are “similar” to a subset of the “trajectories” allowed by $a$, then $a$ seeks more power and is more probable under optimality than $a’$ for all $0≤γ≤1$.
This result is extremely powerful because it doesn’t care about the discount rate, but the similarity condition may be hard to satisfy.
These two theorems give us a formally correct framework for reasoning about generic optimal behavior, even if we aren’t able to compute any individual optimal policy! They reduce questions of powerseeking to checking graphical conditions.
Even though my results apply to stochastic mdps of any finite size, we illustrated using known toy environments. However, this mdp “model” is rarely explicitly specified. Even so, ignorance of the model does not imply that the model disobeys these theorems. Instead of claiming that a specific model accurately represents the task of interest, I think it makes more sense to argue that no reasonable model could fail to exhibit convergent instrumentality and powerseeking. For example, if deactivation is represented by a single state, no reasonable model of the mdp could have most agents agreeing to be deactivated.
In realworld settings, it seems unlikely a priori that the agent’s optimal trajectories run through the relatively smaller part of future in which it cooperates with humans. These results translate that hunch into mathematics.
AI alignment research often feels slippery. We’re trying hard to become less confused about basic questions, like:
 What are “agents”?
 Do people even have “values”, and should we try to get the AI to learn them?
 What does it mean to be “corrigible”, or “deceptive”?
 What are our machine learning models even doing?
We have to do philosophical work while in a state of significant confusion and ignorance about the nature of intelligence and alignment.
In this case, we’d noticed that slight reward function misspecification seems to lead to doom, but we didn’t really know why. Intuitively, it’s pretty obvious that most agents don’t have deactivation as their dream outcome, but we couldn’t actually point to any formal explanations, and we certainly couldn’t make precise predictions.
On its own, Goodhart’s law doesn’t explain why optimizing proxy goals leads to catastrophically bad outcomes, instead of just lessthanideal outcomes.
I think that we’re now starting to have this kind of understanding. I suspect that powerseeking is why capable, goaldirected agency is so dangerous by default. If we want to consider more benign alternatives to goaldirected agency, then deeply understanding the rot at the heart of goaldirected agency is important for evaluating alternatives. This work lets us get a feel for the generic incentives of reinforcement learning at optimality.
power might be important for reasoning about the strategystealing assumption (and I think it might be similar to what Paul Christiano means by “flexible influence over the future”). Evan Hubinger has already noted the utility of the distribution of attainable utility shifts for thinking about valueneutrality in this context (and power is another facet of the same phenomenon). If you want to think about whether, when, and why mesa optimizers might try to seize power, this theory seems like a valuable tool.
Optimality probability might be relevant for thinking about myopic agency, as the work formally describes how optimal action tends to change with the discount factor.
And, of course, we’re going to use this understanding of power to design an impact measure.
There’s a lot of work I think would be exciting, most of which I suspect will support our current beliefs about powerseeking incentives:
 These results assume you can see all of the world at once.
 These results assume the environment is finite.
 These results don’t say anything about noniid reward.
 These results don’t prove that powerseeking is bad for other agents in the environment.
 These results don’t prove that powerseeking is hard to disincentivize.
 Learned policies are rarely optimal.
That said, I think there’s still an important lesson here. Imagine you have good formal reasons to suspect that typing random strings will usually blow up your computer and kill you. Would you then say, “I’m not planning to type random strings” and proceed to enter your thesis into a word processor? No. You wouldn’t type anything, not until you really, really understand what makes the computer blow up sometimes.
Speaking to the broader debate taking place in the AI research community, I think a productive stance will involve investigating and understanding these results in more detail, getting curious about unexpected phenomena, and seeing how the numbers crunch out in reasonable models.
Optimal Policies Tend to Seek PowerIn the context of mdps, we formalized a reasonable notion of power and showed conditions under which optimal policies tend to seek it. We believe that our results suggest that in general, reward functions are best optimized by seeking power. We caution that in realistic tasks, learned policies are rarely optimal—our results do not mathematically prove that hypothetical superintelligent RL agents will seek power. We hope that this work and its formalisms will foster thoughtful, serious, and rigorous discussion of this possibility.
AcknowledgementsThis work was made possible by the Center for HumanCompatible AI, the Berkeley Existential Risk Initiative, and the LongTerm Future Fund.
Logan Smith (
elriggs
) spent an enormous amount of time writing Mathematica code to compute power and measure in arbitrary toy mdps, saving me from computing many quintuple integrations by hand. I thank Rohin Shah for his detailed feedback and brainstorming over the summer of 2019, and I thank Andrew Critch for significantly improving this work through his detailed critiques. Last but not least, thanks to:
 Zack M. Davis, Chase Denecke, William Ellsworth, Vahid Ghadakchi, Ofer Givoli, Evan Hubinger, Neale Ratzlaff, Jess Riedel, Duncan Sabien, Davide Zagami, and
TheMajor
for feedback on version 1 of this post. Alex Appel (diffractor), Emma Fickel, Vanessa Kosoy, Steve Omohundro, Neale Ratzlaff, and Mark Xu for reading / giving feedback on version 2 of this post.
Find out when I post more content: newsletter & RSS
alex@turntrout.com

Throughout Reframing Impact, we’ve been considering an agent’s attainable utility: their ability to get what they want (their onpolicy value, in RL terminology). Optimal value is a kind of “idealized” attainable utility: the agent’s attainable utility were they to act optimally. ⤴

Even though instrumental convergence was discovered when thinking about the real world, similar selfpreservation strategies turn out to be convergently instrumental in e.g. PacMan. ⤴