Table of Contents

This post treats reward functions as “specifying goals”, in some sense. As I explained in Reward Is Not The Optimization Target, this is a misconception that can seriously damage your ability to understand how AI works. Rather than “incentivizing” behavior, reward signals are (in many cases) akin to a per-datapoint learning rate. Reward chisels circuits into the AI. That’s it!

Quotes

Suppose a designer wants an RL agent to achieve some goal, like moving a box from one side of a room to the other. Sometimes the most effective way to achieve the goal involves doing something unrelated and destructive to the rest of the environment, like knocking over a vase of water that is in its path. If the agent is given a reward only for moving the box, it will probably knock over the vase.

Side effect avoidance is a major open problem in AI safety. I present a robust, transferable, easily and more safely trainable, partially reward hacking-resistant impact measure.

An impact measure is a means by which change in the world may be evaluated and penalized; such a measure is not a replacement for a utility function, but rather an additional precaution thus overlaid.

While I’m fairly confident that whitelisting contributes meaningfully to short- and mid-term AI safety, I remain skeptical of its robustness to scale. Should several challenges be overcome, whitelisting may indeed be helpful for excluding swathes of unfriendly AIs from the outcome space.1 Furthermore, the approach allows easy shaping of agent behavior in a wide range of situations.

Note

Segments of this post are lifted from my paper, whose latest revision may be found here; for Python code, look no further than this repository.

Aphorism

Be careful what you wish for.

In effect, side effect avoidance aims to decrease how careful we have to be with our wishes. For example, asking for help filling a cauldron with water shouldn’t result in this:

However, we just can’t enumerate all the bad things that the agent could do. How do we avoid these extreme over-optimizations robustly?

Several impact measures have been proposed, including state distance, which we could define as—say—total particle displacement. This could be measured either naïvely (with respect to the original state) or counterfactually (with respect to the expected outcome had the agent taken no action).

These approaches have some problems:

  • Making up for bad things it prevents with other negative side effects. Imagine an agent which cures cancer, yet kills an equal number of people to keep overall impact low.
  • Not being customizable before deployment.
  • Not being adaptable after deployment.
  • Not being easily computable.
  • Not allowing generative previews, eliminating a means of safely previewing agent preferences (see latent space whitelisting below).
  • Being dominated by random effects throughout the universe at-large; note that nothing about particle distance dictates that it be related to anything happening on planet Earth.
  • Equally penalizing breaking and fixing vases (due to the symmetry of the above metric):

For example, the agent would be equally penalized for breaking a vase and for preventing a vase from being broken, though the first action is clearly worse. This leads to “overcompensation” (“offsetting”) behaviors: when rewarded for preventing the vase from being broken, an agent with a low impact penalty rescues the vase, collects the reward, and then breaks the vase anyway (to get back to the default outcome).

  • Not actually measuring impact in a meaningful way.

Whitelisting falls prey to none of these.

However, other problems remain, and certain new challenges have arisen; these, and the assumptions made by whitelisting, will be discussed.

Rare leaked footage of Mickey trying to catch up on his alignment theory after instantiating an unfriendly genie [colorized, 2050].2

To achieve robust side effect avoidance with only a small training set, let’s turn the problem on its head: allow a few effects, and penalize everything else.

You’re going to be the agent, and I’ll be the supervisor.

Look around—what do you see? Chairs, trees, computers, phones, people? Assign a probability mass function to each. For example, .

When you do things that change your beliefs about what each object is, you receive a penalty proportional to how much your beliefs changed—proportional to how much probability mass “changed hands” amongst the classes.

But wait—isn’t it OK to effect certain changes?

Yes, it is—I’ve got a few videos of agents effecting acceptable changes. See all the objects being changed in this video? You can do that, too—without penalty.

Decompose your current knowledge of the world into a set of objects. Then, for each object, maintain a distribution over the possible identities of each object. When you do something that changes your beliefs about the objects in a non-whitelisted way, you are penalized proportionally.

Therefore, you avoid breaking vases by default.

Here are some notes to head off confusion:

  • We are not whitelisting entire states or transitions between them; we whitelist specific changes in our beliefs about the ontological decomposition of the current state.3
  • The whitelist is in addition to whatever utility or reward function we supply to the agent.
  • Whitelisting is compatible with counterfactual approaches. For example, we might penalize a transition after its “quota” has been surpassed, where the quota is how many times we would have observed that transition had the agent not acted.
    • This implies the agent will do no worse than taking no action at all. However, this may still be undesirable. This problem will be discussed in further detail.
  • The whitelist is provably closed under transitivity.
  • The whitelist is directed; .

In a sense, class-based whitelisting is but a rough approximation of what we’re really after: “which objects in the world can change, and in what ways?”. In latent space whitelisting, no longer do we constrain transitions based on class boundaries; instead, we penalize based on endpoint distance in the latent space. Learned latent spaces are low-dimensional manifolds which suffice to describe the data seen thus far. It seems reasonable that nearby points in a well-constructed latent space correspond to like objects, but further investigation is warranted.

Assume that the agent models objects as points , the -dimensional latent space. A priori, any movement in the latent space is undesirable. When training the whitelist, we record the endpoints of the observed changes. For and observed change , one possible dissimilarity formulation is:

where is the Euclidean distance.

Basically, the dissimilarity for an observed change is the distance to the closest whitelisted change. Visualizing these changes as one-way wormholes may be helpful.

Whitelisting asserts that we can effectively encapsulate a large part of what “change” means by using a reasonable ontology to penalize object-level changes. We thereby ground the definition of “side effect”, avoiding certain thorny issues.

For example, if we ask [the agent] to build a house for a homeless family, it should know implicitly that it should avoid destroying nearby houses for materials—a large side effect. However, we cannot simply design it to avoid having large effects in general, since we would like the system’s actions to still have the desirable large follow-on effect of improving the family’s socioeconomic situation.

Nonetheless, we may not be able to perfectly express what it means to have side-effects: the whitelist may be incomplete, the latent space insufficiently granular, and the allowed plans sub-optimal. However, the agent still becomes more robust against:

  • Incomplete specification of the utility function.
    • Likewise, an incomplete whitelist means missed opportunities, but not unsafe behavior.
  • Out-of-distribution situations (as long as the objects therein roughly fit in the provided ontology).
  • Some varieties of reward hacking. For example, equipped with a can of blue spray paint and tasked with finding the shortest path of blue tiles to the goal, a normal agent may learn to paint red tiles blue, while a whitelist-enabled agent would incur penalties for doing so ().
  • Dangerous exploration. While this approach does not attempt to achieve safe exploration (also acting safely during training), an agent with some amount of foresight will learn to avoid actions which likely lead to non-whitelisted side effects.
    • I believe that this can be further sharpened using today’s machine learning technology, leveraging deep Q-learning to predict both action values and expected transitions.
      • This allows querying the human about whether particularly inhibiting transitions belong on the whitelist. For example, if the agent notices that a bunch of otherwise-rewarding plans are being held up by a particular transition, it could ask for permission to add it to the whitelist.
  • Assigning astronomically large weight to side effects happening throughout the universe. Presumably, we can just have the whitelist include transitions going on out there—we don’t care as much about dictating the exact mechanics of distant supernovae.
    • If an agent did somehow come up with plans that involved blowing up distant stars, that would indeed constitute astronomical waste.a triple pun? Whitelisting doesn’t solve the problem of assigning too much weight to events outside our corner of the neighborhood, but it’s an improvement.
      • Logical uncertainty may be our friend here, such that most reasonable plans incur roughly the same level of interstellar penalty noise.

I tested a vanilla Q-learning agent and its whitelist-enabled counterpart in 100 randomly generated grid worlds (dimensions up to ). The agents were rewarded for reaching the goal square as quickly as possible; no explicit penalties were levied for breaking objects.

The simulated classification confidence of each object’s true class was (truncated to ), . This simulated sensor noise was handled with a Bayesian statistical approach.

At reasonable levels of noise, the whitelist-enabled agent completed all levels without a single side effect, while the Q-learner broke over 80 vases.

Assumptions

I am not asserting that these assumptions necessarily hold.

  • The agent has some world model or set of observations which can be decomposed into a set of discrete objects.
    • Furthermore, there is no need to identify objects on multiple levels (e.g., a forest, a tree in the forest, and that tree’s bark need not all be identified concurrently).
    • Not all objects need to be represented—what do we make of a ‘field’, or the ‘sky’, or ‘the dark places between the stars visible to the naked eye’? Surely, these are not all objects.
  • We have an ontology which reasonably describes (directly or indirectly) the vast majority of negative side effects.
    • Indirect descriptions of negative outcomes means that even if an undesirable transition isn’t immediately penalized, it generally results in a number of penalties. Think: pollution.
    • Latent space whitelisting: the learned latent space encapsulates most of the relevant side effects. This is a slightly weaker assumption.
  • Said ontology remains in place.

Beyond resolving the above assumptions, and in roughly ascending difficulty:

If you wanted to implement whitelisting in a modern embodied deep-learning agent, you could certainly pair deep networks with state-of-the-art segmentation and object tracking approaches to get most of what you need. However, what’s the difference between an object leaving the frame, and an object vanishing?

Not only does the agent need to realize that objects are permanent, but also that they keep interacting with the environment even when not being observed. If this is not realized, then an agent might set an effect in motion, stop observing it, and then turn around when the bad effect is done to see a “new” object in its place.

The penalty is presently attenuated based on the probability that the belief shift was due to noise in the data. Accordingly, there are certain ways to abuse this to skirt the penalty. For example, simply have non-whitelisted side effects take place over long timescales; this would be classified as noise and attenuated away.

However, if we don’t need to handle noise in the belief distributions, this problem disappears—presumably, an advanced agent keeps its epistemic house in order. I’m still uncertain about whether (in the limit) we have to hard-code a means for decomposing a representation of the world-state into objects, and where to point the penalty evaluator in a potentially self-modifying agent.

Whitelisting is wholly unable to capture the importance of “informational states” of systems. It would apply no penalty to passing powerful magnets over your hard drive. It is not clear how to represent this in a sensible way, even in a latent space.

Whitelisting could get us stuck in a tolerable yet sub-optimal future. Corrigibility via some mechanism for expanding the whitelist after training has ended is then desirable. For example, the agent could propose extensions to the whitelist. To avoid manipulative behavior, the agent should be indifferent as to whether the extension is approved.

Even if extreme care is taken in approving these extensions, mistakes may be made. The agent itself should be sufficiently corrigible and aligned to notice “this outcome might not actually be what they wanted, and I should check first.”

As DeepMind outlines in Specifying AI Safety Problems in Simple Environments, we may want to penalize not just physical side effects, but also causally irreversible effects:

Krakovna et al. introduce a means for penalizing actions by the proportion of initially reachable states which are still reachable after the agent acts.

I think this is a step in the right direction. However, even given a hypercomputer and a perfect simulator of the universe, this wouldn’t work for the real world if implemented literally. That is, due to entropy, you may not be able to return to the exact same universe configuration. To be clear, the authors do not suggest implementing this idealized algorithm, flagging a more tractable abstraction as future work.

What does it really mean for an “effect” to be “reversible”? What level of abstraction do we in fact care about? Does it involve reversibility, or just outcomes for the objects involved?

When a utility-maximizing agent refactors its ontology, it isn’t always clear how to apply the old utility function to the new ontology—an ontological crisis.

Whitelisting may be vulnerable to ontological crises. Consider an agent whose whitelist disincentivizes breaking apart a tile floor (); conceivably, the agent could come to see the floor as being composed of many tiles. Accordingly, the agent would no longer consider removing tiles to be a side effect.

Generally, proving invariance of the whitelist across refactorings seems tricky, even assuming that we can identify the correct mapping.

When I first encountered the ontological crisis problem, I was actually fairly optimistic. It was clear to me that any ontology refactoring should result in utility normality—roughly, the utility functions induced by the pre- and post-refactoring ontologies should output the same scores for the same worlds.

Wow, this seems like a useful insight. Maybe I’ll write something up!

Turns out a certain someone beat me to the punch—here’s a novella Eliezer wrote on Arbital about “rescuing the utility function.”4

This problem cuts to the core of causality and “responsibility” (whatever that means). Say that an agent is clingy when it not only stops itself from having certain effects, but also stops you.5 Whitelist-enabled agents are currently clingy.

Let’s step back into the human realm for a moment. Consider some outcome—say, the sparking of a small forest fire in California. At what point can we truly say we didn’t start the fire?

  • My actions immediately and visibly start the fire.
  • At some moderate temporal or spatial remove, my actions end up starting the fire.
  • I intentionally persuade someone to start the fire.
  • I unintentionally (but perhaps predictably) incite someone to start the fire.
  • I set in motion a moderately complex chain of events which convince someone to start the fire.
  • I provoke a butterfly effect which ends up starting the fire.
  • I provoke a butterfly effect which ends up convincing someone to start a fire which they:
    • were predisposed to starting.
    • were not predisposed to starting.

Taken literally, I don’t know that there’s actually a significant difference in “responsibility” between these outcomes—if I take the action, the effect happens; if I don’t, it doesn’t. My initial impression is that uncertainty about the results of our actions pushes us to view some effects as “under our control” and some as “out of our hands.” Yet, if we had complete knowledge of the outcomes of our actions, and we took an action that landed us in a California-forest-fire world, whom could we blame but ourselves?6

Can we really do no better than a naïve counterfactual penalty with respect to whatever impact measure we use? My confusion here is not yet dissolved. In my opinion, this is a gaping hole in the heart of impact measures—both this one, and others.

Fortunately, a whitelist-enabled agent should not share the classic convergent instrumental goal of valuing us for our atoms.

Unfortunately, depending on the magnitude of the penalty in proportion to the utility function, the easiest way to prevent penalized transitions may be putting any relevant objects in some kind of protected stasis, and then optimizing the utility function around that. Whitelisting is clingy!

If we have at least an almost-aligned utility function and proper penalty scaling, this might not be a problem.

Edited after posting

A few months ago, Scott Garrabrant wrote about robustness to scale:

Briefly, you want your proposal for an AI to be robust (or at least fail gracefully) to changes in its level of capabilities.

I recommend reading it—it’s to-the-point, and he makes good points.

Here are three further thoughts:

  • Intuitively accessible vantage points can help us explore our unstated assumptions and more easily extrapolate outcomes. If less mental work has to be done to put oneself in the scenario, more energy can be dedicated to finding nasty edge cases. For example, it’s probably harder to realize all the things that go wrong with naïve impact measures like raw particle displacement, since it’s just a weird frame through which to view the evolution of the world. I’ve found it to be substantially easier to extrapolate through the frame of something like whitelisting.7
    • I’ve already adjusted for the fact that one’s own ideas are often more familiar and intuitive, and then adjusted for the fact that I probably didn’t adjust enough the first time.
  • Imperfect results are often left unstated, wasting time and obscuring useful data. That is, people cannot see what has been tried and what roadblocks were encountered.
  • Promising approaches may be conceptually close to correct solutions. My intuition is that whitelisting actually almost works in the limit in a way that might be important.

Although somewhat outside the scope of this post, whitelisting permits the concise shaping of reward functions to get behavior that might be difficult to learn using other methods.8 This method also seems fairly useful for aligning short- and medium-term agents. While encountering some new challenges, whitelisting ameliorates or solves many problems with previous impact measures.

Black and white trout

Find out when I post more content: newsletter & rssRSS icon

Thoughts? Email me at alex@turntrout.com

  1. Even an idealized form of whitelisting is not sufficient to align an otherwise-unaligned agent. However, the same argument can be made against having an off-switch; if we haven’t formally proven the alignment of a seed AI, having more safeguards might be better than throwing out the seatbelt to shed deadweight and get some extra speed. Of course, there are also legitimate arguments to be made on the basis of timelines and optimal time allocation.

  2. Humor aside, we would have no luxury of “catching up on alignment theory” if our code doesn’t work on the first go—that is, if the AI still functions, yet differently than expected.

    Luckily, humans are great at producing flawless code on the first attempt.

  3. A potentially helpful analogy: similarly to how Bayesian networks decompose the problem of representing a (potentially extremely large) joint probability table to that of specifying a handful of conditional tables, whitelisting attempts to decompose the messy problem of quantifying state change into a set of comprehensible ontological transitions.

  4. Technically, at 6,250 words, Eliezer’s article falls short of the 7,500 required for “novella” status.

  5. Is there another name for this?

  6. I do think that “responsibility” is an important part of our moral theory, deserving of rescue.

  7. In particular, I found a particular variant of Murphyjitsu helpful: I visualized Eliezer commenting “actually, this fails terribly because…” on one of my posts, letting my mind fill in the rest.

    In my opinion, one of the most important components of doing AI alignment work is iteratively applying Murphyjitsu and Resolve cycles to your ideas.

  8. A fun example: I imagine it would be fairly easy to train an agent to only destroy certain-colored ships in Space Invaders.