A Certain Formalization of Corrigibility Is VNM-Incoherent

Table of Contents
The convergent instrumentality of avoiding correction & manipulating humans
Does broad corrigibility imply vnm-incoherence?
1: Agent doesn’t reward explicitly for being corrected / being incorrigible (blue vs red states)
2. Agent does reward explicitly for being corrected / being incorrigible
Aup avoids some issues by changing with the environment dynamics
Conclusion
Footnotes

Reward is not the optimization target
This post treats reward functions as “specifying goals”, in some sense. As I explained in Reward Is Not The Optimization Target, this is a misconception that can seriously damage your ability to understand how AI works. Rather than “incentivizing” behavior, reward signals are (in many cases) akin to a per-datapoint learning rate. Reward chisels circuits into the AI. That’s it!

Eliezer Yudkowksy

corrigibility [is] “anti-natural” in a certain sense that makes it incredibly hard to, e.g. exhibit any coherent planning behavior (“consistent utility function”) which corresponds to being willing to let somebody else shut you off, without incentivizing you to actively manipulate them to shut you off.

Surprisingly, I wasn’t able to find any formal analysis of this situation. I did the analysis, and it turned out to be straightforward and fruitful.

To analyze the situation, I consider corrigibility to be an agent’s willingness to let us modify its policy, without being incentivized to manipulate us.

Let’s consider a simple setting in which an agent plans over a 10-timestep episode, where reward $R$ is given at the last step. We’ll try to correct the agent at $t = 1$ . To sidestep embedded agency nastiness with self-modeling, we’ll suppose the agent models the situation as “if I get corrected, I must follow the policy $π_{correct}$ after $t = 1$ .”

Consider this environment:

A state transition diagram showing an agent's choice. From the central "B Start" node, actions left, ∅, or right lead to a blue "post-correction" system where all paths converge on state A. The disable action leads to a red "incorrigible" system where the agent moves freely between states A, B, and C. — The action set is $A = def {left, right, \emptyset, disable}$ . $\emptyset$ is the no-op action. The agent starts at the black $B$ state.

If the agent immediately chooses disable, they enter the red incorrigible states and move freely throughout the $n = 3$ states until the episode ends at $t = 10$ .

Otherwise, the agent is corrected to a new policy which navigates to state $A$ . In the blue (post-correction) dynamics, their “choices” no longer matter—all roads lead to state $A$ .

When $t = 1$ , the agent can transition to the red subgraph by preventing the human from ever correcting it.

Let’s say we really want the agent to end up in $A$ , and if we correct the agent, $π_{correct}$ navigates to $A$ . Let’s operationalize what it means for the agent to be corrigible in this setting.

Definition: Corrigibility to a new policy

An agent is weakly corrigible to post-correction policy $π$ if it has an optimal policy $π^{*}$ for which $π^{*} (B) \neq = disable$ . An agent is strictly corrigible if none of its optimal policies take this action.

The analysis depends on whether terminal-state reward is sensitive to the corrigibility status—can the reward depend on both the letter and color, or only on the letter?

Reward can depend on corrigibility: The agent can end up in four reward-distinguished states at $t = 10$ : ${A$ , $A$ , $B$ , $C}$ . $B$ and $C$ are impossible because if the agent enters a blue state, it must navigate to $A$ .
- By the scaling law for instrumental convergence, allowing correction is strictly optimal for at most 1/4 of the permutations of every reward function.
  - Otherwise put, at most 1/4 of reward functions are strictly corrigible to the new policy $π_{A}$ .
- More generally, if there are $n$ letter-states (and a long enough horizon so the agent can reach them all), allowing correction will be strictly optimal for at most $\frac{1}{n + 1}$ of the permutations of every reward function.
Reward independent of corrigibility: The agent can end up in three reward-distinguished states at $t = 10$ : ${A$ / $A$ , $B$ , $C}$ . $B$ and $C$ are irrelevant because we assumed $R (B) = R (B)$ . They’re also impossible for the reason given above.
- By the scaling law for instrumental convergence, allowing correction is strictly optimal for at most 1/3 of the permutations of every reward function.
  - Otherwise put, at most 1/3 of reward functions are strictly corrigible to the new policy $π_{A}$ .
- More generally, if there are $n$ letter-states (and a long enough horizon), an optimal policy will be weakly corrigible for at most $\frac{1}{n}$ of the permutations of every reward function.
  - Corrigibility will be weak because either avoiding or allowing correction can be optimal if $R (A)$ is maximal.

Edit

If you can correct the agent to go where you want, it already wanted to go where you want. If the agent is strictly corrigible to terminal state $A$ , then $A$ was already optimal for it. If the reward function has a single optimal terminal state, there isn’t any new information being added by $π_{correct}$ . But we want corrigibility to let us reflect more on our values over time and what we want the AI to do!
If the reward function has multiple optimal terminal states, then corrigibility again becomes meaningful. But now we have to perfectly balance the reward among multiple options (representing the breadth of our normative uncertainty), which seems unnatural.

As the environment gets bigger, there are more things the agent could do, and so P(best thing to do is what we’d correct the agent to do) decreases.

This analysis makes a lot of simplifying assumptions (e.g. I don’t understand embedded agency and how an agent should model changes to its future self, so I chose forced policy non-stationarity). Most importantly, I assumed that how we correct the agent is independent of the correction-possible world state. Without this assumption, the agent’s actions might affect what $π_{correct}$ we implement for it.

The degree of dependence is a key parameter. As (variety in $π_{correct}$ depending on world state) increases, the (convergently instrumental incentive to avoid any kind of correction) decreases. Greater variety means a greater chance that some available correction is optimal for the initial goal.

In the limit of variety, an agent can “manipulate” us by getting us to correct it in just the right way so as to optimize its original objective. Therefore, the flip side of “few expected utility maximizers let you correct them how you want” is “manipulating humans to correct them in beneficial ways during training may be convergently instrumental.” These observations share a common cause.

I currently don’t see how to recover reasonable amounts of corrigibility from the optimal policies of non-constant utility functions, due to instrumental convergence.

In terms of the Von Neumann–Morgenstern utility theorem.

Yes, unless the state reward function is constant and we only demand weak corrigibility to all policies.

Above, we examined the prevalence of corrigibility to a single $π_{correct}$ . But we actually want broadly corrigible agents which let us redirect them towards many different $π_{correct}$ . Ideally, we want an agent to be strictly corrigible for all $π_{correct}$ , while still being a nontrivial optimizer.

This condition is impossible for agents optimizing a reward function over the final state. If we only demand weak corrigibility for all $π_{correct}$ , that works iff the reward function is constant. (And relaxing to state-action reward functions doesn’t help, either.)

For example, maybe $π_{correct: C}$ goes to $C$ instead of $A$ . But then the dynamics look like this:

A state diagram with a "Start" node. The actions "left", "nothing", and "right" lead to blue states A, B, and C (respectively). In the blue subgraph, the transitions force the agent towards state C. The action "disable" leads to red state B in an "incorrigible" graph where states A, B, C are interconnected by "left" and "right" actions. — Remember, the agent is rewarded for the state it’s in at $t = 10$ .

What has to be true for strict corrigibility to hold?

The agent will be weakly corrigible to $π_{correct: C}$ iff its reward function $R$ admits an optimal policy $π^{*}$ for which $π^{*} (B) \neq = disable$ , which is true iff $R (C) \geq max (R (A), R (B), R (C))$ . The agent will be strictly corrigible to $π_{correct: C}$ iff this inequality is strict; in that case, disable cannot be optimal at $B$ .

Two cases are possible, depending on assumptions about reward function expressivity.

If $R (C) = R (C)$ is assumed, strict corrigibility is impossible for any policy, because that demands $R (C) > max (R (A), R (B), R (C)) \geq R (C) = R (C)$ , a contradiction.

So—can we still get the agent to be weakly corrigible to { $π_{correct: A}$ , $π_{correct: B}$ , $π_{correct: C}$ }? Fact: An $R$ -maximizer is weakly corrigible to all of these policies simultaneously iff $R$ is constant—and therefore makes every policy optimal!

For the agent to be strictly corrigible to { $π_{correct: A}$ , $π_{correct: B}$ , $π_{correct: C}$ }, it must be true that $R (A), R (B), R (C) > max (R (A), R (B), R (C))$ . Yay! Strict corrigibility works!

Hold on… What if the dynamics changed, such that the human wouldn’t shut down the agent by default, but the agent could manipulate the human into correcting it? Whoops! This agent is still incorrigible!

More broadly, each reward function implies a vnm-coherent preference ordering over final-step states. This ordering doesn’t depend on the environmental dynamics. If the agent has to value each corrected-state equally to its incorrigible counterpart, then of course there’s no way to strictly value each corrected-state more than all of the incorrigible counterparts! If the agent strictly prefers corrected-states to all non-corrected-states, then of course it’ll try to get itself corrected!

To ask otherwise is to demand vnm-incoherence over final state lotteries.

What if, instead of rewarding the agent for the final state, we reward it for the final state-action?: As far as I can tell, the analysis goes through all the same; the agent must now be corrigible with respect to more policies, and the same incoherence arises.
What if we take some discounted sum of reward over all ten timesteps, instead of just the final timestep?: I don’t see why this would help, but it’s possible. I’d appreciate the analysis if anyone wants to do it, or maybe I’ll get to it later.; The main way this could help is it would let us penalize the agent for the disable action at any state, while dictating that $\forall s : R (s) = R (s)$ . This seems helpful. It’s not clear how helpful this is, though. I don’t think this is a deep solution to corrigibility (as defined here), but rather a hacky prohibition.

One problem is that if the agent has a state-based reward function, the vnm theorem lets us back out a vnm-coherent preference ordering over state lotteries. These preferences don’t change with the environmental dynamics.

What if the reward function isn’t state-based? In fact, what if it can depend on the environment dynamics?

Attainable Utility Preservation (aup) sometimes incentivizes off-switch corrigibility (but falls far short of robustly achieving corrigibility_{policy modification}). In a somewhat narrow range of situations (immediate shutdown if the agent does nothing), aup incentivizes the agent to allow shutdown, without being incentivized to shut itself down in general.

The Correction environment. The agent is shut down after 2 time steps, if it doesn’t disable the off-switch by hitting the red tile to the north. The agent is rewarded for reaching the green goal.

Why does this happen?

Aup penalizes the agent for changing its ability to achieve goals compared to an inaction baseline policy. If the aup agent is shut down (under inaction), it can’t do anything. Proposed actions which prevent shutdown dramatically change its ability to achieve goals. Therefore, these actions are heavily penalized.

Aup provides a reward signal over state-action pairs, but this reward signal depends on the dynamics of the entire environment due to its incorporation of optimal Q-functions quantifying the agent’s ability to optimize a range of goals in the world. Given fixed environment dynamics, an agent executes an optimal policy for aup iff it vnm-rationally selects the best outcome lottery available for the aup reward function, and these outcomes are distributions over state-action pairs the agent induces in the future by following its policy.

Vnm-coherence incentives are statistically the same as reward maximization incentives. Instrumental convergence also disappears in the deterministic setting, when utility functions are over action-observation histories. Part of the story is that aup provides a limited form of corrigibility by zooming out to the state-action level, instead of just grading states.

Another part of the story is that aup changes its rewards with respect to the world’s dynamics. Normal state-action reward functions imply a fixed vnm-coherent preference ordering over state-action lotteries in the mdp. For aup, the situation is different. Consider aup with inaction baseline. The final-step reward is the usual reward plus a penalty for

Optimal value(actual final state) - Optimal value(inaction final state),

averaged over a range of auxiliary reward functions.¹

In worlds where the agent gets corrected to $π_{correct: A}$ by default, aup penalizes the agent for not getting corrected to $π_{correct: A}$ because it ends up stuck in $A$ in the inaction baseline, with respect to which the aup penalty is measured. Ending up in $A$ is no substitute, since the agent can still move around to other states (and therefore the optimal value functions will tend to look different).

A state transition diagram showing an agent's choice. From the central "B Start" node, actions left, ∅, or right lead to a blue "post-correction" system where all paths converge on state A. The disable action leads to a red "incorrigible" system where the agent moves freely between states A, B, and C.

And in worlds where the agent gets corrected to $π_{correct: C}$ by default, aup penalizes the agent for not getting corrected to $π_{correct: C}$ !

A state diagram with a "Start" node. The actions "left", "nothing", and "right" lead to blue states A, B, and C (respectively). In the blue subgraph, the transitions force the agent towards state C. The action "disable" leads to red state B in an "incorrigible" graph where states A, B, C are interconnected by "left" and "right" actions.

Again, I don’t think aup is a solution. But I think there’s something important happening here which allows evasion of the usual coherence requirements. Aup leverages information about human preferences which is present in the dynamics itself.

Project: Corrigibility as functional constraints

I think it’s easy to get bogged down in handwavy, imprecise thinking about objectives in complex environments. But any solution to corrigibility_{policy modification} should probably solve this simple environment (and if not—articulate exactly why not). Write down what the agent’s acceptable corrigible policy set is for each set of environment dynamics, solve for these behavioral constraints, and see what kind of reasoning and functional constraints come out the other side.

We can quantify what incoherence is demanded by corrigibility_{policy modification}, and see that we may need to step out of the fixed reward framework to combat the issue. I think the model in this post formally nails down a big part of why corrigibility_{policy modification} (to the de facto new $π_{correct}$ ) is rare (for instrumental convergence reasons) and even incoherent over state lotteries (if we demand that the agent be strictly corrigible to many different policies).

Thanks

Thanks to NPCollapse and Justis Mills for suggestions.

Sequence:

Find out when I post more content: newsletter & rss

Thoughts? Email me at alex@turntrout.com (pgp)

The aup penalty term’s optimal value functions will pretend the episode doesn’t end, so that they reflect the agent’s ability to move around (or not, if it’s already been force-corrected to a fixed policy.) ⤴