Table of contents
- Looking back on my alignment PhD
- The shard theory of human values
- Bruce Wayne and the cost of inaction
- Formalizing “deception” using game theory
- You should read “Harry Potter and the Methods of Rationality”
Originally, my content was hosted on LessWrong. Much of that content was meant to be consumed as part of a series—or “sequence”—of blog posts.
In early 2018, I became convinced that the AI alignment problem needed to be solved. Who, though, would solve it?
I didn’t remember much formal math or computer science, but I wanted to give my all anyways. I started reading textbooks. I started reading a lot of textbooks.
Original sequence descriptionYou can never have enough books.
My journey through the MIRI research guide.
- Set Up for Success: Insights from “Naïve Set Theory”
- Lightness and Unease
- The Art of the Artificial: Insights from “Artificial Intelligence: A Modern Approach”
- The First Rung: Insights from “Linear Algebra Done Right”
- Internalizing Internal Double Crux
- Confounded No Longer: Insights from “All of Statistics”
- Into the Kiln: Insights from Tao’s “Analysis I”
- Swimming Upstream: A Case Study in Instrumental Rationality
- Making a Difference Tempore: Insights from “Reinforcement Learning: An Introduction”
- Turning Up the Heat: Insights from Tao’s “Analysis II”
- And My Axiom! Insights from “Computability and Logic”
- Judgment Day: Insights from “Judgment in Managerial Decision Making”
- Continuous Improvement: Insights from “Topology”
- A Kernel of Truth: Insights from “A Friendly Approach to Functional Analysis”
- Problem Relaxation as a Tactic
- Insights from Euclid’s “Elements”
- Insights from “Modern Principles of Economics”
- Do a Cost-Benefit Analysis of Your Technology Usage
- Looking Back on my Alignment PhD
Original sequence descriptionWhy do some things seem like really big deals to us? Do most agents best achieve their goals by seeking power? How might we avert catastrophic incentives in the utility maximization framework?
Introductory post: Reframing Impact
- Value Impact
- Deducing Impact
- Attainable Utility Theory: Why Things Matter
- World State is the Wrong Abstraction for Impact
- The Gears of Impact
- Seeking Power is Often Convergently Instrumental in MDPs
- Attainable Utility Landscape: How The World Is Changed
- The Catastrophic Convergence Conjecture
- Attainable Utility Preservation: Concepts
- Attainable Utility Preservation: Empirical Results
- How Low Should Fruit Hang Before We Pick It?
- Attainable Utility Preservation: Scaling to Superhuman
- Reasons for Excitement about Impact of Impact Measure Research
- Conclusion to “Reframing Impact”
This sequence generalizes the math of Seeking Power is Often Convergently Instrumental in MDPs. The posts follow up on Seeking Power is Often Convergently Instrumental in MDPs and The Catastrophic Convergence Conjecture.
Original sequence descriptionInstrumental convergence posits that smart goal-directed agents will tend to take certain actions (e.g. gain resources, stay alive) in order to achieve their goals. These actions seem to involve taking power from humans. Human disempowerment seems like a key part of how AI might go very, very wrong.
But where does instrumental convergence come from? When does it occur, and how strongly? And what does the math look like?
Many posts in this sequence treat reward functions as “specifying goals”, in some sense. This is wrong, as I have argued at length. Reward signals are akin to a per-datapoint learning rate. Reward chisels circuits into the AI. That’s it!
- Power as Easily Exploitable Opportunities
- Generalizing POWER to Multi-Agent Games
- MDP Models Are Determined by the Agent Architecture and the Environment
- Environmental Structure Can Cause Instrumental Convergence
- A World in Which the Alignment Problem Seems Lower-Stakes
- The More Power at Stake, the Stronger Instrumental Convergence
- Seeking Power Is Convergently Instrumental in a Broad Class of Environments
- When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives
- Satisficers Tend to Seek Power: Instrumental Convergence via Retargetability
- Instrumental Convergence for Realistic Agent Objectives
- Parametrically Retargetable Decision-Makers Tend to Seek Power
Original sequence descriptionMy writings on different kinds of corrigibility. These thoughts build on each other and form part of my alignment worldview (circa 2021), but they are not yet woven into a coherent narrative.
- Non-Obstruction: A Simple Concept Motivating Corrigibility
- Corrigibility As Outside View
- A Certain Formalization of Corrigibility is VNM-Incoherent
- Formalizing Policy Modification Corrigibility
In early 2022, Quintin Pope and I noticed glaring problems at the heart of “classical” alignment arguments. We thought through the problem with fresh eyes and derived shard theory.
Classical arguments focus on what the goal of an AI will be. Why? There’s not a good answer that I’ve ever heard. Shard theory redirects our attention from fixed single objectives. The basic upshot of shard theory: AIs and humans are well-understood1 as having a bunch of situationally activated goals—“shards” of desire and preference.
For example, you probably care more about people you can see. Shard theory predicts this outcome. Consider your learned decision-making circuits which bid for actions which care for your friend Bill. These circuits were probably formed when you were able to see Bill (or perhaps the vast majority of your “caring about people” circuits were formed when physically around people). If you can see Bill, that situation is more “similar to the training distribution” for your “caring about Bill” shard. Therefore, the Bill shard is especially likely to fire when you can see him.
Thus, it seems OK if our AIs don’t have “perfect” shard mixtures. The stronger their “aligned shards”, the more human welfare weighs on their decision-making. We’re playing a game of inches, so let’s play to win.
- Humans Provide an Untapped Wealth of Evidence About Alignment
- Human Values & Biases Are Inaccessible to the Genome
- General Alignment Properties
- Evolution Is a Bad Analogy for AGI: Inner Alignment
- Reward Is Not the Optimization Target
- The Shard Theory of Human Values
- Understanding and Avoiding Value Drift
- A Shot at the Diamond-Alignment Problem
- Don’t Design Agents Which Exploit Adversarial Inputs
- Don’t Align Agents to Evaluations of Plans
- Alignment Allows “Nonrobust” Decision-Influences and Doesn’t Require Robust Grading
- Inner and Outer Alignment Decompose One Hard Problem Into Two Extremely Hard Problems
My work with my MATS 3.0 scholars, Ulisse Mini and Peli Grietzer!
Original sequence descriptionMechanistic interpretability on a pretrained policy network from Goal Misgeneralization in Deep Reinforcement Learning.
- Predictions for Shard Theory Mechanistic Interpretability Results
- Understanding and Controlling a Maze-Solving Policy Network
- Maze-Solving Agents: Add a Top-Right Vector, Make the Agent Go to the Top-Right
- Behavioral Statistics for a Maze-Solving Agent
-
Team Shard’s later experimental work provided strong evidence that AI policies are not just well-understood as having shards, but in fact mechanistically learn a blend of multiple situationally activated goals. ⤴