1/30/2025Steering Gemini Using BIDPO Vectorsactivation engineeringAI1/23/2025Insights From “The Manga Guide to Physiology”understanding the worldsummaries1/15/2025Gaming TruthfulQA: Simple Heuristics Exposed Dataset WeaknessescritiqueAI12/17/2024Breaking Free with Dr. Stonepersonalfiction12/16/2024Creating Interpretable Latent Spaces with Gradient Routingmats programAI12/5/2024Gradient Routing: Masking Gradients to Localize Computation in Neural Networksmats programAI12/4/2024Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Modelsactivation engineeringmats programAI12/4/2024Testing Site Featureswebsite10/31/2024Mistaken Claims I’ve Madepersonalwebsite10/31/2024The Design of This Websitewebsite10/30/2024Intrinsic Power-Seeking: AI Might Seek Power for Power’s Sakeinstrumental convergenceshard theoryAI10/30/2024Site Launch: Come Relax by The Pond!personalwebsite10/27/2024About Mepersonal10/27/2024My ResearchAI10/27/2024Posts & Sequenceswebsite10/27/2024I’m that “Other Fish in the Sea”personal10/27/2024The Pondwebsite10/27/2024Can Transformers Act on Information Beyond an Effective Layer Horizon?understanding the worldAI7/15/2024I Found >800 Orthogonal “Write Code” Steering Vectorsactivation engineeringmats programAI4/30/2024Mechanistically Eliciting Latent Behaviors in Language Modelsunderstanding the worldactivation engineeringmats programAI3/5/2024Many Arguments for AI X-Risk Are WrongcritiqueAI2/10/2024Dreams of AI Alignment: The Danger of Suggestive NamesrationalitycritiqueAI1/2/2024Steering Llama-2 with Contrastive Activation Additionsactivation engineeringcorrigibilitymats programAI10/16/2023How Should TurnTrout Handle His DeepMind Equity Situation?practical10/13/2023Paper: Understanding and Controlling a Maze-Solving Policy Networkactivation engineeringshard theoryAI9/9/2023AI Presidents Discuss AI Alignment AgendashumorAI9/6/2023ActAdd: Steering Language Models without Optimizationactivation engineeringAI7/24/2023Open Problems in Activation Engineeringactivation engineeringAI6/20/2023Ban Development of Unpredictable Powerful Models?AI6/19/2023Mode Collapse in RL May Be Fueled by the Update Equationreinforcement learningAI6/2/2023Think Carefully Before Calling RL Policies “Agents”reinforcement learningAI5/13/2023Steering GPT-2-XL by Adding an Activation Vectoractivation engineeringshard theorymats programAI5/7/2023Residual Stream Norms Grow Exponentially over the Forward PassAI4/20/2023Behavioural Statistics for a Maze-Solving Agentmats programshard theoryAI4/1/2023Definitive Confirmation of Shard Theoryshard theoryhumorAI3/31/2023Maze-Solving Agents: Add a Top-Right Vector, Make the Agent Go to the Top-Rightactivation engineeringAI3/11/2023Understanding and Controlling a Maze-Solving Policy Networkactivation engineeringmats programshard theoryAI3/1/2023Predictions for Shard Theory Mechanistic Interpretability Resultsmats programshard theoryrationalityAI2/18/2023Parametrically Retargetable Decision-Makers Tend to Seek Powerinstrumental convergenceAI1/24/2023Some of My Disagreements with List of LethalitiesAI12/17/2022Positive Values Seem More Robust and Lasting than Prohibitionsshard theoryhuman valuesAI12/2/2022Inner and Outer Alignment Decompose One Hard Problem Into Two Extremely Hard Problemsshard theorycritiqueAI11/29/2022Alignment Allows “Non-Robust” Decision-Influences and Doesn’t Require Robust Gradingshard theoryhuman valuesAI11/26/2022Don’t Align Agents to Evaluations of PlanscritiqueAI11/18/2022Don’t Design Agents Which Exploit Adversarial InputscritiqueAI11/8/2022People Care About Each Other Even Though They Have Imperfect Motivational Pointers?corrigibilityAI10/6/2022A Shot at the Diamond-Alignment Problemshard theoryAI10/2/2022Four Usages of “Loss” In AIAI9/30/2022Bruce Wayne and the Cost of Inactionrationalityfiction9/9/2022Understanding and Avoiding Value Drifthuman valuesshard theoryrationalityAI9/4/2022The Shard Theory of Human Valuesunderstanding the worldshard theoryhuman valuesrationalityAI8/11/2022Seriously, What Goes Wrong with “Reward the Agent when It Makes You Smile”?AI8/8/2022General Alignment Propertiesshard theoryAI7/25/2022Reward Is Not the Optimization Targetreinforcement learningshard theoryAI7/14/2022Humans Provide an Untapped Wealth of Evidence About Alignmentshard theoryhuman valuesAI7/7/2022Human Values & Biases Are Inaccessible to the Genomeunderstanding the worldshard theoryhuman values6/30/2022Looking Back on My Alignment PhDgrowth storiesrationalitypersonalAI4/10/2022Emotionally Confronting Doomrationalitypracticalcommunity3/27/2022Do a Cost-Benefit Analysis of Your Technology Usagepractical2/22/2022ELK Proposal: Thinking Via A Human ImitatorAI1/22/2022Instrumental Convergence For Realistic Agent Objectivesinstrumental convergenceAI12/3/2021Formalizing Policy-Modification CorrigibilitycorrigibilityAI11/20/2021A Certain Formalization of Corrigibility Is VNM-Incoherentinstrumental convergencecorrigibilityAI11/18/2021Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetabilityinstrumental convergenceAI11/2/2021You Should Read “Harry Potter and the Methods of Rationality”rationalitypersonalfiction9/22/2021Insights From Modern Principles of Economicsunderstanding the world8/9/2021When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentivesinstrumental convergencerationalityAI8/8/2021Seeking Power Is Convergently Instrumental in a Broad Class of Environmentsinstrumental convergenceAI7/11/2021The More Power At Stake, The Stronger Instrumental Convergence Gets For Optimal Policiesinstrumental convergenceAI7/8/2021A World in Which the Alignment Problem Seems Lower-Stakesinstrumental convergenceAI6/22/2021Environmental Structure Can Cause Instrumental Convergenceinstrumental convergenceAI6/16/2021How Can We Quantify Player Alignment in 2×2 Normal-Form Games?game theoryAI6/8/2021Conservative Agency with Multiple Stakeholdersimpact regularizationAI6/8/2021Game-Theoretic Alignment in Terms of Attainable Utilitygame theoryAI5/26/2021MDP Models Are Determined by the Agent Architecture and the Environmental Dynamicsinstrumental convergenceAI3/22/2021Generalizing POWER to Multi-Agent Gamesinstrumental convergenceunderstanding the worldAI1/23/2021Lessons I’ve Learned From Self-Teachingscholarship and learningrationalitypractical1/12/2021Review of “Debate on Instrumental Convergence Between LeCun, Russell, Bengio, Zador, and More”instrumental convergenceAI1/6/2021Review of “But Exactly How Complex and Fragile?”AI12/30/2020Collider Bias as a Cognitive Blindspot?rationality12/23/20202019 Review Rewrite: Seeking Power Is Often Robustly Instrumental in MDPsinstrumental convergenceAI12/12/2020Avoiding Side Effects in Complex Environmentsimpact regularizationAI11/21/2020Non-Obstruction: A Simple Concept Motivating CorrigibilitycorrigibilityAI10/2/2020Math That Clicks: Look for Two-Way Correspondencesunderstanding the worldrationality8/1/2020Power as Easily Exploitable Opportunitiesinstrumental convergenceAI7/23/2020GPT-3 GemsAI7/20/2020To What Extent Is GPT-3 Capable of Reasoning?AI7/12/2020Formalizing “Defection” Using Game Theorygame theoryrationalityAI5/8/2020Corrigibility as Outside ViewcorrigibilityAI5/4/2020Insights From Euclid’s “Elements”scholarship and learningunderstanding the world4/22/2020Problem Relaxation as a TacticrationalityAI4/4/2020A Kernel of Truth: Insights From “A Friendly Approach to Functional Analysis”scholarship and learningunderstanding the world3/25/2020ODE to Joy: Insights From “A First Course in Ordinary Differential Equations”scholarship and learningunderstanding the world2/28/2020Conclusion to “Reframing Impact”impact regularizationAI2/27/2020Attainable Utility Preservation: Scaling to Superhumanimpact regularizationAI2/27/2020Reasons for Excitement About Impact of Impact Measure Researchimpact regularizationAI2/25/2020Choosing the Strength of the Impact Penalty Termimpact regularizationAI2/22/2020Attainable Utility Preservation: Empirical Resultsimpact regularizationAI2/22/2020Continuous Improvement: Insights From “Topology”summaries2/17/2020Attainable Utility Preservation: Conceptsimpact regularizationAI2/14/2020The Catastrophic Convergence Conjectureinstrumental convergenceimpact regularizationAI2/10/2020Attainable Utility Landscape: How The World Is Changedunderstanding the worldimpact regularization1/10/2020On Being Robustrationalitypersonal12/29/2019Judgment Day: Insights From “Judgment in Managerial Decision Making”scholarship and learningunderstanding the world12/22/2019Can Fear of the Dark Bias Us More Generally?understanding the world12/5/2019Seeking Power Is Often Convergently Instrumental in MDPsinstrumental convergenceAI11/19/2019How I Do Researchscholarship and learningrationality10/10/2019Thoughts on “Human-Compatible”AI10/7/2019The Gears of Impactunderstanding the worldimpact regularization10/1/2019World State Is the Wrong Abstraction for Impactunderstanding the worldimpact regularizationAI9/27/2019Attainable Utility Theory: Why Things Matterunderstanding the worldimpact regularization9/24/2019Deducing Impactunderstanding the worldimpact regularization9/23/2019Value Impactunderstanding the worldimpact regularization9/20/2019Reframing Impactimpact regularizationAI9/13/2019What You See Isn’t Always What You Wantreinforcement learningAI4/10/2019Best Reasons for Pessimism About Impact of Impact Measures?impact regularization1/16/2019And My Axiom! Insights From “Computability and Logic”summaries12/28/2018Penalizing Impact via Attainable Utility Preservationimpact regularizationAI9/18/2018Towards a New Impact Measureimpact regularization9/2/2018Impact Measure Desiderataimpact regularization8/24/2018Turning Up the Heat: Insights From Tao’s “Analysis II”summaries7/29/2018I Want to Take Off the Coatrationalitypersonal7/5/2018Making a Difference Tempore: Insights From “Reinforcement Learning: An Introduction”reinforcement learningsummaries6/30/2018Overcoming Clinginess in Impact Measuresimpact regularization6/16/2018Worrying About the Vase: Whitelistingimpact regularizationAI6/3/2018Swimming Upstream: A Case Study in Instrumental Rationalitygrowth storiespracticalpersonal6/1/2018Into the Kiln: Insights From Tao’s “Analysis I”understanding the worldsummaries5/3/2018Confounded No Longer: Insights From “All of Statistics”scholarship and learningsummaries4/30/2018Internalizing Internal Double Cruxrationalitypracticalpersonal4/22/2018The First Rung: Insights From “Linear Algebra Done Right”scholarship and learning4/3/2018Unyielding Yoda Timers: Taking the Hammertime Final Examrationality3/28/2018Open-Category ClassificationAI3/25/2018The Art of the Artificial: Insights From “Artificial Intelligence: A Modern Approach”scholarship and learningAI3/21/2018Lightness and Uneasepersonal3/7/2018How to Dissolve Itrationalitypractical2/28/2018Set Up for Success: Insights From “Naïve Set Theory”scholarship and learning2/26/2018Walkthrough of “Formalizing Convergent Instrumental Goals”instrumental convergenceAI1/24/2018Interpersonal Approaches for X-Risk Educationcommunity