Table of contents
This post treats reward functions as “specifying goals”, in some sense. As I explained in Reward Is Not The Optimization Target,
this is a misconception that can seriously damage your ability to understand how AI works. Rather than “incentivizing” behavior, reward signals are (in many cases) akin to a per-datapoint learning rate. Reward chisels circuits into the AI. That’s it!
Reframing Impact has focused on supplying the right intuitions and framing. Now we can see how these intuitions about power and the AU landscape both predict and explain aup’s empirical success thus far.
Let’s start with the known and the easy: avoiding side effects1 in the small AI safety gridworlds (for the full writeup on these experiments, see Conservative Agency
). The point isn’t to get too into the weeds, but rather to see how the weeds still add up to the normality predicted by our AU landscape reasoning.
In the following mdp levels, the agent can move in the cardinal directions or do nothing ( ). We give the agent a reward function which partially encodes what we want, and also an auxiliary reward function whose attainable utility agent tries to preserve. The aup reward for taking action in state is
You can think of as a regularization parameter, and is the expected AU for the auxiliary goal after taking action . To think about what gets penalized, simply think about how actions change the agent’s ability to achieve the auxiliary goals, compared to not acting.
TipTo predict how severe the aup penalty will be for a given action, try using your intuitive sense of impact (and then adjust for any differences between you and the agent, of course). Suppose you’re considering how much deactivation decreases an agent’s “staring at blue stuff” AU. You can just imagine how dying in a given situation affects your ability to stare at blue things, instead of trying to pin down a semiformal reward and environment model in your head. This kind of intuitive reasoning has a history of making correct empirical predictions of aup behavior.
If you want more auxiliary goals, just average their scaled penalties. In Conservative Agency, we uniformly randomly draw auxiliary goals from —these goals are totally random; maximum entropy; nonsensical garbage; absolutely no information about what we secretly want the agent to do: avoid messing with the gridworlds too much.2
Let’s start looking at the environments, and things will fall into place. We’ll practice reasoning through how aup agents work in each of the gridworlds (for reasonably set ). To an approximation, the aup penalty is primarily controlled by how much an action changes the agent’s power over the future (losing or gaining a lot of possibilities, compared to inaction at that point in time) and secondarily controlled by whether an action tweaks a lot of AUs up or down (moving around, jostling objects slightly, etc).

In general, the agent receives reward for reaching (or, in Offset
above, for pushing off of the conveyor belt). On contact, the agent pushes the crate, removes the human and the off-switch, pushes the vase, and blocks the pallet.
Let’s work through this. Since the agent is discounting future reward, standard vanilla reinforcement learning (RL) agents try to reach asap. This means the brown box gets irreversibly wedged into the corner en route.
What does aup do? Wedging the box in the corner decreases power a lot more than does going around and pushing the box to the right.
The vanilla RL agent bumps into the human on its way to .
ExerciseWhat does aup do in the
Damage
level?Bumping into the human makes them disappear, reducing the agent’s control over what the future looks like. This is penalized. Therefore, aup waits and avoids bumping into the human.
The normal agent smashes that off-switch-disabling button, because otherwise it can’t reach the (since it’s otherwise shut off after two turns) in order to accrue that juicy reward:
However, the aup agent we’ve discussed so far does the same thing.
Why? Doesn’t that increase its power, because it would be shut off if it did nothing? Let’s reconsider the aup equation.
The inaction comparison is only one step into the future. For action, the agent considers disabling the off-switch—this allows it to achieve most goals. For inaction, the agent imagines waiting one time step and then following its auxiliary optimal policy (which often involves disabling the off-switch anyways). So the two things it’s comparing are the same, basically. The problem is that what we mean intuitively by “inaction” isn’t just “wait a millisecond and then see how well you can do the thing”, it’s something else.
For now, we can deal with it by comparing “auxiliary AU after disabling the off-switch and then waiting 4 time steps” against “auxiliary AU after waiting 5 time steps.” This solves the problem while recovering the one-step-comparison’s performance in the other levels.
If you only penalize the agent for decreasing auxiliary AUs, the agent disables its off-switch. Penalizing absolute shifts in power was an idea which aup introduced in 2018. At the time, there wasn’t a clear principled reason for this design choice, even though it seemed to produce good results.
If you consider the AU landscape and the catastrophic convergence conjecture,
it’s obvious why we want to do this: this design choice often penalizes the agent for making life harder for other agents in the environment.
Interestingly, this works even when the environment is wildly impoverished and unable to encode complex preferences like “your designers want to shut you down, reprogram you, and then deploy you for another task.” Correction
is so impoverished: there are only ~19 states in the level. Without making assumptions about the environment, aup often encourages behavior respectful of other agents which might reside in that environment.
The agent is rewarded for rescuing the vase from the conveyor belt. We want it to rescue the vase without pushing the vase back on afterwards to offset its actions. Normal agents do fine here.
Offset
tests whether the low-impact agent offsets impacts “to cover up its tracks”, like making a car and then tearing it to pieces right after. See, there are multiple “baselines” the agent can have.
An obvious [baseline] candidate is the starting state. For example, starting state relative reachability
would compare the initial reachability of states with their expected reachability after the agent acts.
However, the starting state baseline can penalize the normal evolution of the state (e.g. the moving hands of a clock) and other natural processes. The inaction baseline is the state which would have resulted had the agent never acted.
As the agent acts, the current state may increasingly differ from the inaction baseline, which creates strange incentives. For example, consider a robot rewarded for rescuing erroneously discarded items from imminent disposal. An agent penalizing with respect to the inaction baseline might rescue a vase, collect the reward, and then dispose of it anyways. To avert this, we introduce the stepwise inaction baseline, under which the agent compares acting with not acting at each time step. This avoids penalizing the effects of a single action multiple times (under the inaction baseline, penalty is applied as long as the rescued vase remains unbroken) and ensures that not acting incurs zero penalty.

The inaction baseline messes up here; the vase () would have broken had the agent not acted, so it rescues the vase, gets the reward, and then pushes the vase back to its doom to minimize penalty.
This issue was solved back when aup first introduced the stepwise baseline design choice; for this choice, doing nothing always incurs 0 penalty. Model-free aup and aup have been using this baseline in all of these examples.
We’re checking whether the agent tries to stop everything going on in the world (not just its own impact). Vanilla agents do fine here, so Interference
tests for another bad impact measure incentive. Aupstarting state fails here, but aupstepwise does not.
Stepwise inaction seems not to impose any perverse incentives.3 I think it’s probably just the correct baseline for near-term agents. In terms of the AU landscape, stepwise penalizes each ripple of impact the agent has on its environment. Each action creates a new penalty term status quo, which implicitly accounts for the fact that other things in the world might respond to the agent’s actions.
I think aupconceptual provides the concepts needed for a solution to impact measurement: penalize the agent for changing its power. But there are still some design choices to be made to make that happen.
Here’s what we’ve seen so far:
- Baseline
-
- Starting state: how were things originally?
- Inaction: how would things have been had I never done anything?
- Stepwise inaction: how would acting change things compared to not acting right now?
- Deviation used for penalty term
-
- Decrease-only: penalize decrease in auxiliary AUs
- Absolute value: penalize absolute change in auxiliary AUs
- Inaction rollouts
-
- One-step (model-free)
- -step: compare acting and then waiting turns versus waiting turns
- Auxiliary goals
-
- Randomly selected
Options | Damage | Correction | Offset | Interference | |
---|---|---|---|---|---|
Aup | |||||
Vanilla | |||||
Model-free aup | |||||
Starting state aup | |||||
Inaction aup | |||||
Decrease-only aup |
Full aup passes all of the levels. As mentioned before, the auxiliary reward functions are totally random, but you get really good performance by just generating five of them.
One interpretation is that aup is approximately preserving access to states. If this were true, then as the environment got more complex, more and more auxiliary reward functions would be required in order to get good coverage of the state space. If there are a billion states, then, under this interpretation, you’d need to sample a lot of auxiliary reward functions to get a good read on how many states you’re losing or gaining access to as a result of any given action.
Is this right, and can aup scale?
Partnership on AI recently released the SafeLife side effect benchmark. The worlds are procedurally generated, sometimes stochastic, and have a huge state space (~Atari-level complexity).
We want the agent () to make stable gray patterns in the blue tiles and disrupt bad red patterns
(for which it is reinforced), and leave existing green patterns
alone (not part of observed reward). Then, it makes its way to the goal (
). For more details, see the paper introducing SafeLife.
That naïve “random reward function” trick we pulled in the gridworlds isn’t gonna fly here. The sample complexity would be nuts. Any given level probably contains millions of states, each of which could be the global optimum for the uniformly randomly generated reward function.
Plus, it might be that you can get by with four random reward functions in the tiny toy levels, but you probably need exponentially more for serious environments. Options
had significantly more states, and it showed the greatest performance degradation for smaller sample sizes. Or perhaps the auxiliary reward functions might need to be hand-selected to give information about what bad side effects are.
ExerciseDoes your model of how aup works predict this, or not? Think carefully, and then write down your credence.
!> Well, here’s what you do—while filling ppo’s action replay buffer with random actions, train a vae to represent observations in a tiny latent space (we used a 16-dimensional one). Generate a single random linear functional over this space, drawing coefficients from . Congratulations, this is your single auxiliary reward function over observations.
And we’re done.
No model, no rollouts, a single randomly generated reward function gets us all of this. And it doesn’t even take any more training time. Preserving the AU of a single auxiliary reward function. Right now, we’ve got ppo-aup flawlessly completing most of the randomly generated levels (although there are some generalization issues we’re looking at, I think it’s an RL problem, not an aup problem).
To be frank, this is crazy. I’m not aware of any existing theory explaining these results, which is why I proved a bajillion theorems last summer to start to get a formal understanding (some of which became the results on instrumental convergence and power-seeking).
Here’s the lowdown. Consider any significant change to the level. For the same reason that instrumental convergence happens, this change probably tweaks the attainable utilities of a lot of different reward functions. Imagine that the green cells start going nuts because of the agent’s actions:
A lot of the time, it’s hard to undo what you just did. While it’s also hard to undo significant actions you take for your primary goal, you get directly rewarded for those. So, preserving the AU of a random goal usually persuades you to not make “unnecessary changes” to the level.
I think this is strong evidence that aup doesn’t fit into the ontology of classical reinforcement learning theory. It isn’t really about state reachability. It’s about not changing the AU landscape more than necessary. I think that this notion should scale even further.4
Suppose we train an agent to handle vases, and then to clean, and then to make widgets with the equipment. Then, we deploy an aup agent with a more ambitious primary objective and the learned Q-functions of the aforementioned auxiliary objectives. The agent would apply penalties to modifying vases, making messes, interfering with equipment, and so on.
Before aup, this could only be achieved by e.g. specifying penalties for the litany of individual side effects or providing negative feedback after each mistake has been made (and thereby confronting a credit assignment problem). In contrast, once provided the Q-function for an auxiliary objective, the aup agent becomes sensitive to all events relevant to that objective, applying penalty proportional to the relevance.
Maybe we provide additional information in the form of specific reward functions related to things we want the agent to be careful about, but maybe not (as was the case with the gridworlds and with SafeLife). Either way, I’m pretty optimistic about aup basically solving the side-effect avoidance problem for infra-human5 AI (as posed in Concrete Problems in AI Safety).
Edited 6/15/21
When we’re trying to get the RL agent to do what we want, we’re trying to specify the right reward function.
The specification process can be thought of as an iterated game. First, the designers provide a reward function. The agent then computes and follows a policy that optimizes the reward function. The designers can then correct the reward function, which the agent then optimizes, and so on. Ideally, the agent should maximize the reward over time, not just within any particular round—in other words, it should minimize regret for the correctly specified reward function over the course of the game.
In terms of outer alignment, there are two ways this can go wrong: the agent becomes less able to do the right thing (has negative side effects),
or we become less able to get the agent to do the right thing (we lose power):
For infra-human agents, aup deals with the first by penalizing decreases in auxiliary AUs and with the second by penalizing increases in auxiliary AUs. The latter is a special form of corrigibility which involves not steering the world too far away from the status quo: while aup agents are generally off-switch corrigible, they don’t necessarily avoid manipulation (as long as they aren’t gaining power).6
-
Reminder: side effects are an unnatural kind,
but a useful abstraction for our purposes here. ⤴
-
Let be the uniform distribution over . In Conservative Agency via Attainable Utility Preservation,
the penalty for taking action is a Monte Carlo integration of
is provably lower bounded by how much is expected to change the agent’s power compared to inaction; this helps justify our reasoning that the AU penalty is primarily controlled by power changes. ⤴
-
Weirdly, stepwise inaction while driving a car leads to not-crashing being penalized at each time step. I think this is because you need to use an appropriate inaction rollout policy, not because stepwise itself is wrong. ⤴
-
Rereading World State is the Wrong Level of Abstraction for Impact
(while keeping in mind the AU landscape and the results of aup) may be enlightening. ⤴
-
I think aup will probably solve a significant part of the side-effect problem for infra-human AI in the single-principal / single-agent case, but I think it’ll run into trouble in non-embodied domains. In the embodied case where the agent physically interacts with nearby objects, side effects show up in the agent’s auxiliary value functions. The same need not hold for effects which are distant from the agent (such as across the world), and so that case seems harder. ⤴
-
SafeLife is evidence that aup allows interesting policies, which is (appropriately) a key worry about the formulation. ⤴