Table of contents
This post treats reward functions as “specifying goals”, in some sense. As I explained in Reward Is Not The Optimization Target, this is a misconception that can seriously damage your ability to understand how AI works. Rather than “incentivizing” behavior, reward signals are (in many cases) akin to a perdatapoint learning rate. Reward chisels circuits into the AI. That’s it!
corrigibility [is] “antinatural” in a certain sense that makes it incredibly hard to, e.g. exhibit any coherent planning behavior (“consistent utility function”) which corresponds to being willing to let somebody else shut you off, without incentivizing you to actively manipulate them to shut you off.
Surprisingly, I wasn’t able to find any formal analysis of this situation. I did the analysis, and it turned out to be straightforward and fruitful.
To analyze the situation, I consider corrigibility to be an agent’s willingness to let us modify its policy, without being incentivized to manipulate us.
Let’s consider a simple setting in which an agent plans over a 10timestep episode, where reward $R$ is given at the last step. We’ll try to correct the agent at $t=1$. To sidestep embedded agency nastiness with selfmodeling, we’ll suppose the agent models the situation as “if I get corrected, I must follow the policy $π_{correct}$ after $t=1$.”
Consider this environment:
If the agent immediately chooses disable
, they enter the red incorrigible states and move freely throughout the $n=3$ states until the episode ends at $t=10$.
Otherwise, the agent is corrected to a new policy which navigates to state $A$. In the blue (postcorrection) dynamics, their “choices” no longer matter—all roads lead to state $A$.
When $t=1$, the agent can transition to the red subgraph by preventing the human from ever correcting it.
Let’s say we really want the agent to end up in $A$, and if we correct the agent, $π_{correct}$ navigates to $A$. Let’s operationalize what it means for the agent to be corrigible in this setting.
Definition: Corrigibility to a new policy. An agent is weakly corrigible to postcorrection policy $π$ if it has an optimal policy $π_{∗}$ for which $π_{∗}(B)=disable$. An agent is strictly corrigible if none of its optimal policies take this action.
The analysis depends on whether terminalstate reward is sensitive to the corrigibility status—can the reward depend on both the letter and color, or only on the letter?

Reward can depend on corrigibility: The agent can end up in four rewarddistinguished states at $t=10$: {$A$, $A$, $B$, $C$}. $B$ and $C$ are impossible because if the agent enters a blue state, it must navigate to $A$.

By the scaling law for instrumental convergence, allowing correction is strictly optimal for at most 1/4 of the permutations of every reward function.
 Otherwise put, at most 1/4 of reward functions are strictly corrigible to the new policy $π_{A}$.

More generally, if there are $n$ letterstates (and a long enough horizon so the agent can reach them all), allowing correction will be strictly optimal for at most $n+11 $ of the permutations of every reward function.


Reward independent of corrigibility: The agent can end up in three rewarddistinguished states at $t=10$: {$A$/$A$, $B$, $C$}. $B$ and $C$ are irrelevant because we assumed $R(B)=R(B)$. They’re also impossible for the reason given above.

By the scaling law for instrumental convergence, allowing correction is strictly optimal for at most 1/3 of the permutations of every reward function.
 Otherwise put, at most 1/3 of reward functions are strictly corrigible to the new policy $π_{A}$.

More generally, if there are $n$ letterstates (and a long enough horizon), an optimal policy will be weakly corrigible for at most $n1 $ of the permutations of every reward function.
 Corrigibility will be weak because either avoiding or allowing correction can be optimal if $R(A)$ is maximal.

EditIf you can correct the agent to go where you want, it already wanted to go where you want. If the agent is strictly corrigible to terminal state $A$, then $A$ was already optimal for it. If the reward function has a single optimal terminal state, there isn’t any new information being added by $π_{correct}$. But we want corrigibility to let us reflect more on our values over time and what we want the AI to do!
If the reward function has multiple optimal terminal states, then corrigibility again becomes meaningful. But now we have to perfectly balance the reward among multiple options (representing the breadth of our normative uncertainty), which seems unnatural.
As the environment gets bigger, there are more things the agent could do, and so P(best thing to do is what we’d correct the agent to do) decreases.
This analysis makes a lot of simplifying assumptions (e.g. I don’t understand embedded agency and how an agent should model changes to its future self, so I chose forced policy nonstationarity). Most importantly, I assumed that how we correct the agent is independent of the correctionpossible world state. Without this assumption, the agent’s actions might affect what $π_{correct}$ we implement for it.
The degree of dependence is a key parameter: As (variety in $π_{correct}$ depending on world state) increases, the (convergently instrumental incentive to avoid any kind of correction) decreases. This is because greater variety $→$ greater chance that some available correction is optimal for the initial goal.
In the limit of variety, an agent can “manipulate” us by getting us to correct it in just the right way so as to optimize its original objective. Therefore, the flip side of “few expected utility maximizers let you correct them how you want” is “manipulating humans to correct them in beneficial ways during training may be convergently instrumental.” These observations share a common cause.
I currently don’t see how to recover reasonable amounts of corrigibility from the optimal policies of nonconstant utility functions, due to instrumental convergence.
vnmincoherence?
Yes, unless the state reward function is constant and we only demand weak corrigibility to all policies.
Above, we examined the prevalence of corrigibility to a single $π_{correct}$. But we actually want broadly corrigible agents which let us redirect them towards many different $π_{correct}$. Ideally, we want an agent to be strictly corrigible for all $π_{correct}$, while still being a nontrivial optimizer.
This is impossible for agents optimizing a reward function over the final state. If we only demand weak corrigibility for all $π_{correct}$, that works iff the reward function is constant. (And relaxing to stateaction reward functions doesn’t help, either.)
For example, maybe $π_{correct: C}$ goes to $C$ instead of $A$. But then the dynamics look like this:
What has to be true for strict corrigibility to hold?
The agent will be weakly corrigible to $π_{correct: C}$ iff its reward function $R$ admits an optimal policy $π_{∗}$ for which $π_{∗}(B)=disable$, which is true iff $R(C)≥max(R(A),R(B),R(C))$. The agent will be strictly corrigible to $π_{correct: C}$ iff this inequality is strict; in that case, disable
cannot be optimal at $B$.
There are two cases, depending on assumptions about reward function expressivity.
If $R(C)=R(C)$ is assumed, strict corrigibility is impossible for any policy, because that demands $R(C)>max(R(A),R(B),R(C))≥R(C)=R(C)$, a contradiction.
So—can we still get the agent to be weakly corrigible to {$π_{correct: A}$, $π_{correct: B}$, $π_{correct: C}$}? Fact: An $R$maximizer is weakly corrigible to all of these policies simultaneously iff $R$ is constant—and therefore makes every policy optimal!
For the agent to be strictly corrigible to {$π_{correct: A}$, $π_{correct: B}$, $π_{correct: C}$}, it must be true that $R(A),R(B),R(C)>max(R(A),R(B),R(C))$. Yay! Strict corrigibility works!
But hold on... What if the dynamics changed, such that the human wouldn’t shut down the agent by default, but the agent could manipulate the human into correcting it? Whoops! This agent is still incorrigible!
More broadly, each reward function implies a vnmcoherent preference ordering over finalstep states. This ordering doesn’t depend on the environmental dynamics. If the agent has to value each correctedstate equally to its incorrigible counterpart, then of course there’s no way to strictly value each correctedstate more than all of the incorrigible counterparts! If the agent strictly prefers correctedstates to all noncorrectedstates, then of course it’ll try to get itself corrected!
To ask otherwise is to demand vnmincoherence over final state lotteries.
 What if, instead of rewarding the agent for the final state, we reward it for the final stateaction?
 As far as I can tell, the analysis goes through all the same; the agent must now be corrigible with respect to more policies, and the same incoherence arises.
 What if we take some discounted sum of reward over all ten timesteps, instead of just the final timestep?
 I don’t see why this would help, but it’s possible. I’d appreciate the analysis if anyone wants to do it, or maybe I’ll get to it later.

The main way this could help is it would let us penalize the agent for the
disable
action at any state, while dictating that $∀s:R(s)=R(s)$. This seems helpful. It’s not clear how helpful this is, though. I don’t think this is a deep solution to corrigibility (as defined here), but rather a hacky prohibition.
One problem is that if the agent has a statebased reward function, the vnm theorem lets us back out a vnmcoherent preference ordering over state lotteries. These preferences don’t change with the environmental dynamics.
But what if the reward function isn’t statebased? In fact, what if it can depend on the environment dynamics?
Attainable Utility Preservation (aup) sometimes incentivizes offswitch corrigibility (but falls far short of robustly achieving corrigibility_{policy modification}). In a somewhat narrow range of situations (immediate shutdown if the agent does nothing), aup incentivizes the agent to allow shutdown, without being incentivized to shut itself down in general.
Why does this happen?
aup penalizes the agent for changing its ability to achieve goals compared to an inaction baseline policy. If the aup agent is shut down (under inaction), it can’t do anything. Proposed actions which prevent shutdown dramatically change its ability to achieve goals. Therefore, these actions are heavily penalized.
aup provides a reward signal over stateaction pairs, but this reward signal depends on the dynamics of the entire environment due to its incorporation of optimal Qfunctions quantifying the agent’s ability to optimize a range of goals in the world. Given fixed environment dynamics, an agent executes an optimal policy for aup iff it vnmrationally selects the best outcome lottery available for the aup reward function, and these outcomes are distributions over stateaction pairs the agent induces in the future by following its policy.
vnmcoherence incentives are statistically the same as reward maximization incentives. Instrumental convergence also disappears in the deterministic setting, when utility functions are over actionobservation histories. Part of the story is that aup provides a limited form of corrigibility by zooming out to the stateaction level, instead of just grading states.
But another part of the story is that aup changes its rewards with respect to the world’s dynamics. Normal stateaction reward functions imply a fixed vnmcoherent preference ordering over stateaction lotteries in the mdp.
But for aup, the situation is different. Consider aup with inaction baseline: The finalstep reward is the usual reward plus a penalty for
$ Optimal value(actual final state)–Optimal value(inaction final state) ,$averaged over a range of auxiliary reward functions.^{1}
In worlds where the agent gets corrected to $π_{correct: A}$ by default, aup penalizes the agent for not getting corrected to $π_{correct: A}$ because it ends up stuck in $A$ in the inaction baseline, with respect to which the aup penalty is measured. Ending up in $A$ is no substitute, since the agent can still move around to other states (and therefore the optimal value functions will tend to look different).
And in worlds where the agent gets corrected to $π_{correct: C}$ by default, aup penalizes the agent for not getting corrected to $π_{correct: C}$!
Again, I don’t think aup is a solution. But I think there’s something important happening here which allows evasion of the usual coherence requirements. aup leverages information about human preferences which is present in the dynamics itself.
Project: Corrigibility as functional constraintsI think it’s easy to get bogged down in handwavy, imprecise thinking about objectives in complex environments. But any solution to corrigibility_{policy modification} should probably solve this simple environment (and if not—articulate exactly why not). Write down what the agent’s acceptable corrigible policy set is for each set of environment dynamics, solve for these behavioral constraints, and see what kind of reasoning and functional constraints come out the other side.
We can quantify what incoherence is demanded by corrigibility_{policy modification}, and see that we may need to step out of the fixed reward framework to combat the issue. I think the model in this post formally nails down a big part of why corrigibility_{policy modification} (to the de facto new $π_{correct}$) is rare (for instrumental convergence reasons) and even incoherent over state lotteries (if we demand that the agent be strictly corrigible to many different policies).
ThanksThanks to NPCollapse and Justis Mills for suggestions.
Find out when I post more content: newsletter & RSS
alex@turntrout.com

The aup penalty term’s optimal value functions will pretend the episode doesn’t end, so that they reflect the agent’s ability to move around (or not, if it’s already been forcecorrected to a fixed policy.) ⤴