The Pond

Search

About me
My research
Random post
All posts
Open source
Subscribe

Tag: shard theory

18 items with this tag.

12/23/2025
Apply for Alignment Mentorship From TurnTrout and Alex Cloud
10/30/2024
Intrinsic Power-Seeking: AI Might Seek Power for Power’s Sake
10/13/2023
Paper: Understanding and Controlling a Maze-Solving Policy Network
5/13/2023
Steering GPT-2-XL by Adding an Activation Vector
4/20/2023
Behavioural Statistics for a Maze-Solving Agent
4/1/2023
Definitive Confirmation of Shard Theory
3/11/2023
Understanding and Controlling a Maze-Solving Policy Network
3/1/2023
Predictions for Shard Theory Mechanistic Interpretability Results
12/17/2022
Positive Values Seem More Robust and Lasting than Prohibitions
12/2/2022
Inner and Outer Alignment Decompose One Hard Problem Into Two Extremely Hard Problems
11/29/2022
Alignment Allows “Non-Robust” Decision-Influences and Doesn’t Require Robust Grading
10/6/2022
A Shot at the Diamond-Alignment Problem
- shard theory
- AI
9/9/2022
Understanding and Avoiding Value Drift
9/4/2022
The Shard Theory of Human Values
8/8/2022
General Alignment Properties
- shard theory
- AI
7/25/2022
Reward Is Not the Optimization Target
7/14/2022
Humans Provide an Untapped Wealth of Evidence About Alignment
7/7/2022
Human Values & Biases Are Inaccessible to the Genome