Table of contents
This post treats reward functions as “specifying goals”, in some sense. As I explained in Reward Is Not The Optimization Target, this is a misconception that can seriously damage your ability to understand how AI works. Rather than “incentivizing” behavior, reward signals are (in many cases) akin to a perdatapoint learning rate. Reward chisels circuits into the AI. That’s it!
Environmental Structure Can Cause Instrumental Convergence explains how powerseeking incentives can arise because there are simply many more ways for powerseeking to be optimal, than for it not to be optimal. Colloquially, there are lots of ways for “get money and take over the world” to be part of an optimal policy, but relatively few ways for “die immediately” to be optimal. (And here, each “way something can be optimal” is a reward function which makes that thing optimal.)
But how strong is this effect, quantitatively?
I previously speculated that we should be able to get quantitative lower bounds on how many objectives incentivize powerseeking actions:
Definition. At state $s$, most reward functions incentivize action $a$ over action $a_{′}$ when for all reward functions $R$, at least half of the orbit agrees that $a$ has at least as much action value as $a_{′}$ does at state $s$.
...
What does “most reward functions” mean quantitatively—is it just at least half of each orbit? Or, are there situations where we can guarantee that at least threequarters of each orbit incentivizes powerseeking? I think we should be able to prove that as the environment gets more complex, there are combinatorially more permutations which enforce these similarities, and so the orbits should skew harder and harder towards powerincentivization.
About a week later, I had my answer:
Scaling law for instrumental convergence (informal)If policy set $Π_{A}$ lets you do "$n$ times as many things” than policy set $Π_{B}$ lets you do, then for every reward function, A is optimal over B for at least $n+1n $ of its permuted variants (i.e. orbit elements).
For example, $Π_{A}$ might contain the policies where you stay alive, and $Π_{B}$ may be the other policies: the set of policies where you enter one of several death states.
Conjecture which I think I see how to proveFor almost all reward functions, A is strictly optimal over B for at least $n+1n $ of its permuted variants.
Basically, when you could apply the previous results but “multiple times”,^{1} you can get lower bounds on how often the larger set of things is optimal:
Roughly, the theorem says: if the set 1 of options can be embedded 3 times into another set 2 of options (where the images are disjoint), then at least $3+13 =43 $ of all variations on all reward functions agree that set 2 is optimal.
And in way larger environments—like the real world, where there are trillions and trillions of things you can do if you stay alive, and not much you can do otherwise—nearly all orbit elements will make survival optimal.
I see this theory as beginning to link the richness of the agent’s environment, with the difficulty of aligning that agent: for optimal policies, instrumental convergence strengthens proportionally to the ratio of $control if you diecontrol if you survive $.
The proofs are currently in an Overleaf. But here’s one intuition, using the candy
, chocolate
, and reward
example environment.
Consider any reward function which says candy
is strictly optimal. Then candy
is strictly optimal over both chocolate
and hug
.
We have two permutations: one switching the reward for candy
and chocolate
, and one switching reward for candy
and hug
. Each permutation produces a different orbit element (a different reward function variant). The permuted variants both agree that Wait!
is strictly optimal.
So there are at least twice as many orbit elements for which Wait!
is strictly optimal over candy
, than those for which candy
is strictly optimal over Wait!
. Either one of Start
’s child states (candy/Wait!
) is strictly optimal, or they’re both optimal. If they’re both optimal, Wait!
is optimal. Otherwise, Wait!
makes up at least 2/3 of the orbit elements for which strict optimality holds.
Conjecture: Fractional scaling law for instrumental convergence (informal)If staying alive lets you do $n$ “things” and dying lets you do $m≤n$ “things”, then for every reward function, staying alive is optimal for at least $n+mn $ of its orbit elements.
I’m reasonably confident this is true, but I haven’t worked through the combinatorics yet. This would slightly strengthen the existing lower bounds in certain situations. For example, suppose dying gives you 2 choices of terminal state, but living gives you 51 choices. The current result only lets you prove that at least $50+250 =2625 $ of the orbit incentivizes survival. The fractional lower bound would slightly improve this to $51+251 =5351 $.
In certain ways, the results are indifferent to e.g. increased precision in agent sensors: it doesn’t matter if dying gives you 1 option and living gives you $n$ options, or if dying gives you 2 options and living gives you $2n$ options.
Similarly, you can do the inverse operations to simplify subgraphs in a way that respects the theorems:
This is the start of a theory on what state abstractions “respect” the theorems, although there’s still a lot I don’t understand there. (I’ve barely thought about it so far.)
Last time, in addition to the “how do combinatorics work?” question I posed, I wrote several qualifications:
The combinatorics conjectures will help prove the latter
 They assume the agent is following an optimal policy for a reward function
 I can relax this to $ϵ$optimality, but $ϵ>0$ may be extremely small
 They assume the environment is finite and fully observable
 Not all environments have the right symmetries
 But most ones we think about seem to
 The results don’t account for the ways in which we might practically express reward functions
 For example, often we use featurized reward functions. While most permutations of any featurized reward function will seek power in the considered situation, those permutations need not respect the featurization (and so may not even be practically expressible).
 When I say “most objectives seek power in this situation”, that means in that situation  it doesn’t mean that most objectives take the powerseeking move in most situations in that environment
Let’s take care of that last one. I was actually being too cautious, since the existing results already show us how to reason across multiple situations. The reason is simple: suppose we use my results to prove that when the agent maximizes average pertimestep reward, it’s strictly optimal for at least 99.99% of objective variants to stay alive. This is because the death states are strictly suboptimal for these variants. For all of these variants, no matter the situation the agent finds itself in, it’ll be optimal to try to avoid the strictly suboptimal death states.
This doesn’t mean that these variants always incentivize moves which are formally powerseeking, but it does mean that we can sometimes prove what optimal policies tend to do across a range of situations.
So now we find ourselves with a slimmer list of qualifications:
Quote
 They assume the agent is following an optimal policy for a reward function
 I can relax this to $ϵ$optimality, but $ϵ>0$ may be extremely small
 They assume the environment is finite and fully observable
 Not all environments have the right symmetries
 But most ones we think about seem to
 The results don’t account for the ways in which we might practically express reward functions
 For example, stateaction versus statebased reward functions (this particular case doesn’t seem too bad, I was able to sketch out some nice results rather quickly, since you can convert stateaction mdps into statebased reward mdps and then apply my results).
It turns out to be surprisingly easy to do away with (2). We’ll get to that next time.
For (3), environments which “almost” have the right symmetries should also “almost” obey the theorems. To give a quick, nonlegible sketch of my reasoning:
For the uniform distribution over reward functions on the unit hypercube ($[0,1]_{∣S∣}$), optimality probability should be Lipschitz continuous on the available state visit distributions (in some appropriate sense). Then if the theorems are “almost” obeyed, instrumentally convergent actions still should have extremely high probability, and so most of the orbits still have to agree.
So I don’t currently view (3) as a huge deal. I’ll probably talk more about that another time.
This should bring us to interfacing with (1) (“how smart is the agent? How does it think, and what options will it tend to choose?” — this seems hard) and (4) (“for what kinds of reward specification procedures are there way more ways to incentivize powerseeking, than there are ways to not incentivize powerseeking?” — this seems more tractable).
This scaling law deconfuses me about why it seems so hard to specify nontrivial realworld objectives which don’t have incorrigible shutdownavoidance incentives when maximized.
ThanksThanks to Connor Leahy, Rohin Shah, Adam Shimi, and John Wentworth for feedback on this post.
Find out when I post more content: newsletter & RSS
alex@turntrout.com

I’m using scare quotes regularly because there aren’t short English explanations for the exact technical conditions. But this post is written so that the highlevel takeaways should be right. ⤴