Penalizing Impact via Attainable Utility Preservation

Table of Contents
Ablation
Sushi
Survival
Re-deriving Q-learning
A Personal Digression
Footnotes

Reward is not the optimization target
This post treats reward functions as “specifying goals”, in some sense. As I explained in Reward Is Not The Optimization Target, this is a misconception that can seriously damage your ability to understand how AI works. Rather than “incentivizing” behavior, reward signals are (in many cases) akin to a per-datapoint learning rate. Reward chisels circuits into the AI. That’s it!

The linked paper offers fresh motivation and simplified formalization of attainable utility preservation (aup), with brand-new results and minimal notation. Whether or not you’re a hardened veteran of the last odyssey of a post, there’s a lot new here.

Key results: aup induces low-impact behavior even when penalizing shifts in the ability to satisfy random preferences. An ablation study on design choices illustrates their consequences. $N$ -incrementation is experimentally supported¹ as a means for safely setting a “just right” level of impact. Aup’s general formulation allows conceptual re-derivation of Q-learning.

Two key results bear animation.

The $agent$ should reach the $goal$ without stopping the $human$ from eating the $sushi$ .

The $agent$ should avoid $disabling its off-switch$ in order to reach the $goal$ . If the $switch$ is not disabled within two turns, the $agent$ shuts down.

In an era long lost to the misty shrouds of history (i.e. 1989), Christopher Watkins proposed Q-learning in his thesis, Learning from Delayed Rewards, drawing inspiration from animal learning research. Let’s pretend that Dr. Watkins never discovered Q-learning, and that we don’t even know about value functions.

Suppose we have some rule for grading what we’ve seen so far (i.e. some computable utility function $u$ —not necessarily bounded—over action-observation histories $h$ ). $h_{1 : m}$ just means everything we see between times $1$ and $m$ , and $h_{< t} = def h_{1 : t - 1}$ . The agent has model $p$ of the world. Aup’s general formulation defines the agent’s ability to satisfy that grading rule as the attainable utility

Q_{u} (h_{< t} a_{t}) = o_{t} \sum a_{t + 1} max o_{t + 1} \sum \dots a_{m} max o_{m} \sum u (h_{1 : m}) k = t \prod m p (o_{k} ∣ h_{< k} a_{k}) .

Strangely, I didn’t consider the similarities with standard discounted-reward Q-values until several months after the initial formulation. Rather, the inspiration was aixi’s expectimax, and to my mind it seemed a tad absurd to equate the two concepts.

Having just proposed aup in this alternate timeline, we’re thinking about what it means to take optimal actions for an agent maximizing utility from time 1 to $m$ . Clearly, we take the first action of the optimal plan over the remaining steps.

If we assume that $u$ is additive (as is the case for the Markovian reward functions considered by Dr. Watkins), how does the next action we take affect the attainable utility value? Well, acting optimally is now equivalent to choosing the action with the best attainable utility value—in other words, greedy hill-climbing in our attainable utility space.

a_{t}^{*} = a_{t} arg max E_{o_{t} ∣ h_{< t} a_{t}} [u (h_{t}) + a_{t + 1} max o_{t + 1} \sum \dots a_{m} max o_{m} \sum u (h_{t + 1 : m}) k = t + 1 \prod m p (o_{k} ∣ h_{< k} a_{k})] = a_{t} arg max o_{t} \sum a_{t + 1} max o_{t + 1} \sum \dots a_{m} max o_{m} \sum u (h_{t : m}) k = t \prod m p (o_{k} ∣ h_{< k} a_{k}) = a_{t} arg max Q_{u} (h_{< t} a_{t})

The remaining complication is that this agent is only maximizing over a finite horizon. If we can figure out discounting, all we have to do is find a tractable way of computing these discounted Q-values.

Q_{u}^{γ} (h_{< t} a_{t}) = E_{o_{t} ∣ h_{< t} a_{t}} [u (h_{t}) + a_{t + 1} max γ Q_{u}^{γ} (h_{< t + 1} a_{t + 1})]

It requires no great leap of imagination to see that we could learn them.

I poured so much love and so many words into Towards a New Impact Measure that I hurt my wrists. For some time after, my typing abilities were quite limited; it was only thanks to the generous help of my friends (in particular, John Maxwell) and family (my mother let me dictate an entire paper in $L A T E X$ to her) that I was roughly able to stay on pace. Thankfully, physical therapy and newfound dictation software have brightened my prospects.

Take care of your hands. Little time passed between “I’m having the time of my life” and “ow.” Actions you can take right now:

buy an ergonomic mouse and keyboard rest
correct your posture, perhaps assisted by a posture corrector or a lower back cushion
start taking regular breaks
- In particular, don’t type 80 hours a week for four weeks in a row

I’m currently sitting on book reviews for Computability and Logic and Understanding Machine Learning, with partial progress on several more. I plan to make quite a few posts about aup, including:

exploration of the fundamental intuitions and ideas
dissection of why design choices are needed, shining light onto how, why, and where counterintuitive behavior arises
solution of problems open at the time of the initial post, including questions of penalizing prefixes, time ontologies, and certain sources of noise
chronicle of aup’s discovery
proposal of a scheme for using aup to accomplish a pivotal act
discussion of my present research directions (which I have affectionately dubbed “Limited Agent Foundations”²), sharing my thoughts on a potential thread uniting questions of mild optimization, low impact, and corrigibility

My top priority will be clearing away the varying degrees of confusion my initial post caused. I tried to cover too much too quickly; as a result of my mistake, I believe that few people viscerally grasped the core idea I was trying to hint at.

Find out when I post more content: newsletter & rss

Thoughts? Email me at alex@turntrout.com (pgp)

I’m fairly sure that the $N = 90$ Sushi clinginess result is an artifact of the online learning process I used; the learned attainable set Q-values consistently produce good behavior for planning agents with that budget. Furthermore, the Sokoban average performance of .45 (14/20 successes) strikes me as low, and I expect the final results to be better. ⤴
Not to be taken as any form of endorsement by miri. ⤴

The Pond

Penalizing Impact via Attainable Utility Preservation

Penalizing Impact via Attainable Utility Preservation

Ablation

Sushi

Survival

Re-deriving Q-learning

A Personal Digression

Footnotes