Dark mode
Search
17 items with this tag.
Intrinsic Power-Seeking: AI Might Seek Power for Power’s Sake
Paper: Understanding and Controlling a Maze-Solving Policy Network
Steering GPT-2-XL by Adding an Activation Vector
Behavioural Statistics for a Maze-Solving Agent
Definitive Confirmation of Shard Theory
Understanding and Controlling a Maze-Solving Policy Network
Predictions for Shard Theory Mechanistic Interpretability Results
Positive Values Seem More Robust and Lasting than Prohibitions
Inner and Outer Alignment Decompose One Hard Problem Into Two Extremely Hard Problems
Alignment Allows “Non-Robust” Decision-Influences and Doesn’t Require Robust Grading
A Shot at the Diamond-Alignment Problem
Understanding and Avoiding Value Drift
The Shard Theory of Human Values
General Alignment Properties
Reward Is Not the Optimization Target
Humans Provide an Untapped Wealth of Evidence About Alignment
Human Values & Biases Are Inaccessible to the Genome