Table of contents
AcknowledgmentsAfter ~700 hours of work over the course of ~9 months, the sequence is finally complete.
This work was made possible by the Center for Human-Compatible AI, the Berkeley Existential Risk Initiative, and the Long-Term Future Fund. Deep thanks to Rohin Shah, Abram Demski, Logan Smith, Evan Hubinger,
TheMajor
, Chase Denecke, Victoria Krakovna, Alper Dumanli, Cody Wild, Matthew Barnett, Daniel Blank, Sara Haxhia, Connor Flexman, Zack M. Davis, Jasmine Wang, Matthew Olson, Rob Bensinger, William Ellsworth, Davide Zagami, Ben Pace, and a million other people for giving feedback on this sequence.
Find out when I post more content: newsletter & RSS
alex@turntrout.com
I’ve made many claims in these posts. All views are my own.
Statement | Credence |
---|---|
There exists a simple closed-form solution to catastrophe avoidance (in the outer alignment sense). | 25% |
For the superhuman case, penalizing the agent for increasing its own Attainable Utility (AU) is better than penalizing the agent for increasing other AUs. | 65% |
Some version of Attainable Utility Preservation solves side effect problems for an extremely wide class of real-world tasks and for subhuman agents. | 65% |
The catastrophic convergence conjecture is true. That is, unaligned goals tend to have catastrophe-inducing optimal policies because of power-seeking incentives. | 70%1 |
Agents trained by powerful RL algorithms on arbitrary reward signals generally try to take over the world. | 75%2 |
aupconceptual prevents catastrophe, assuming the catastrophic convergence conjecture. | 85% |
Attainable Utility theory describes how people feel impacted. | 95% |
NoteThe LessWrong version of this post contained probability estimates from other users.
The big art pieces (and especially the last illustration in this post) were designed to convey a specific meaning, the interpretation of which I leave to the reader.
There are a few pop culture references which I think are obvious enough to not need pointing out, and a lot of hidden smaller playfulness which doesn’t quite rise to the level of “easter egg.”
- Reframing Impact
- The bird’s nest contains a literal easter egg.
-
The paperclip-Balrog drawing contains a Tengwar inscription which reads “one measure to bind them”, with “measure” in impact-blue and “them” in utility-pink.
-
“Towards a New Impact Measure” was the title of the post in which aup was introduced.
- Attainable Utility Theory: Why Things Matter
-
This style of maze is from the video game Undertale.
- Seeking Power is Instrumentally Convergent in mdps
-
To seek power, Frank is trying to get at the Infinity Gauntlet.
- The tale of Frank and the orange Pebblehoarder
- Speaking of under-tales, a friendship has been blossoming right under our noses:
-
There seems to be a dichotomy between “catastrophe directly incentivized by goal” and “catastrophe indirectly incentivized by goal through power-seeking”, although
Vika
provides intuitions in the other direction. ⤴ -
The theorems on power-seeking only apply to optimal policies in fully observable environments, which isn’t realistic for real-world agents. However, I think they’re still informative. There are also strong intuitive arguments for power-seeking. ⤴