Table of contents
This post treats reward functions as “specifying goals”, in some sense. As I explained in Reward Is Not The Optimization Target,
this is a misconception that can seriously damage your ability to understand how AI works. Rather than “incentivizing” behavior, reward signals are (in many cases) akin to a per-datapoint learning rate. Reward chisels circuits into the AI. That’s it!
The linked paper offers fresh motivation and simplified formalization of attainable utility preservation (aup), with brand-new results and minimal notation. Whether or not you’re a hardened veteran of the last odyssey of a post, there’s a lot new here.
Key results: aup induces low-impact behavior even when penalizing shifts in the ability to satisfy random preferences. An ablation study on design choices illustrates their consequences. -incrementation is experimentally supported1 as a means for safely setting a “just right” level of impact. Aup’s general formulation allows conceptual re-derivation of Q-learning.
Two key results bear animation.
The should reach the without stopping the from eating the .
The should avoid in order to reach the . If the is not disabled within two turns, the shuts down.
In an era long lost to the misty shrouds of history (i.e. 1989), Christopher Watkins proposed Q-learning in his thesis, Learning from Delayed Rewards, drawing inspiration from animal learning research. Let’s pretend that Dr. Watkins never discovered Q-learning, and that we don’t even know about value functions.
Suppose we have some rule for grading what we’ve seen so far (i.e. some computable utility function —not necessarily bounded—over action-observation histories ). just means everything we see between times and , and . The agent has model of the world. Aup’s general formulation defines the agent’s ability to satisfy that grading rule as the attainable utility
Strangely, I didn’t consider the similarities with standard discounted-reward Q-values until several months after the initial formulation. Rather, the inspiration was aixi’s expectimax, and to my mind it seemed a tad absurd to equate the two concepts.
Having just proposed aup in this alternate timeline, we’re thinking about what it means to take optimal actions for an agent maximizing utility from time 1 to . Clearly, we take the first action of the optimal plan over the remaining steps.
If we assume that is additive (as is the case for the Markovian reward functions considered by Dr. Watkins), how does the next action we take affect the attainable utility value? Well, acting optimally is now equivalent to choosing the action with the best attainable utility value—in other words, greedy hill-climbing in our attainable utility space.
The remaining complication is that this agent is only maximizing over a finite horizon. If we can figure out discounting, all we have to do is find a tractable way of computing these discounted Q-values.
It requires no great leap of imagination to see that we could learn them.
I poured so much love and so many words into Towards a New Impact Measure that I hurt my wrists. For some time after, my typing abilities were quite limited; it was only thanks to the generous help of my friends (in particular, John Maxwell) and family (my mother let me dictate an entire paper in to her) that I was roughly able to stay on pace. Thankfully, physical therapy and newfound dictation software have brightened my prospects.
Take care of your hands. Little time passed between “I’m having the time of my life” and “ow.” Actions you can take right now:
- buy an ergonomic mouse
and keyboard rest
- correct your posture,
perhaps assisted by a posture corrector
or a lower back cushion
- start taking regular breaks
- In particular, don’t type 80 hours a week for four weeks in a row
I’m currently sitting on book reviews for Computability and Logic and Understanding Machine Learning, with partial progress on several more. There are quite a few posts I plan to make about aup, including:
- exploration of the fundamental intuitions and ideas
- dissection of why design choices are needed, shining light onto how, why, and where counterintuitive behavior arises
- solution of problems open at the time of the initial post, including questions of penalizing prefixes, time ontologies, and certain sources of noise
- chronicle of aup’s discovery
- proposal of a scheme for using aup to accomplish a pivotal act
- discussion of my present research directions (which I have affectionately dubbed “Limited Agent Foundations”2), sharing my thoughts on a potential thread uniting questions of mild optimization, low impact, and corrigibility
My top priority will be clearing away the varying degrees of confusion my initial post caused. I tried to cover too much too quickly; as a result of my mistake, I believe that few people viscerally grasped the core idea I was trying to hint at.
Find out when I post more content: newsletter & RSS
alex@turntrout.com
-
I’m fairly sure that the Sushi clinginess result is an artifact of the online learning process I used; the learned attainable set Q-values consistently produce good behavior for planning agents with that budget. Furthermore, the Sokoban average performance of .45 (14/20 successes) strikes me as low, and I expect the final results to be better. ⤴
-
Not to be taken as any form of endorsement by miri. ⤴