97 items with this tag.12/23/2025Recontextualization Mitigates Specification Gaming Without Modifying the Specificationspecification gamingmats programAI12/23/2025Apply for Alignment Mentorship From TurnTrout and Alex Cloudmats programshard theorycommunityAI12/18/20252025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Targetreinforcement learningspecification gamingAI11/22/2025Output Supervision Can Obfuscate the CoTreinforcement learningmats programAI11/6/2025Consistency Training Helps Stop Sycophancy and Jailbreaksactivation engineeringdeepmindAI7/23/2025We Built a Tool to Protect Your Dataset From Simple Scrapersopen sourceAI6/29/2025A Simple Explanation of AGI Risktalk notesgrinnellAI6/13/2025Distillation Robustifies Unlearningmats programAI3/1/2025Self-Fulfilling Misalignment Data Might Be Poisoning Our AI ModelsAI1/30/2025Steering Gemini Using BIDPO Vectorsactivation engineeringdeepmindAI1/15/2025Gaming TruthfulQA: Simple Heuristics Exposed Dataset WeaknessescritiquedeepmindAI12/16/2024Creating Interpretable Latent Spaces with Gradient Routingmats programAI12/5/2024Gradient Routing: Masking Gradients to Localize Computation in Neural Networksmats programAI12/4/2024Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Modelsactivation engineeringmats programAI10/30/2024Intrinsic Power-Seeking: AI Might Seek Power for Power’s Sakeinstrumental convergenceshard theoryAI10/27/2024My ResearchAI10/27/2024Can Transformers Act on Information Beyond an Effective Layer Horizon?understanding the worldAI7/15/2024I Found >800 Orthogonal “Write Code” Steering Vectorsactivation engineeringmats programAI4/30/2024Mechanistically Eliciting Latent Behaviors in Language Modelsunderstanding the worldactivation engineeringmats programAI3/5/2024Many Arguments for AI X-Risk Are WrongcritiqueAI2/10/2024Dreams of AI Alignment: The Danger of Suggestive NamesrationalitycritiqueAI1/19/2024Don’t Use the “Shoggoth” Meme to Portray LLMscritiqueAI1/2/2024Steering Llama-2 with Contrastive Activation Additionsactivation engineeringcorrigibilitymats programAI10/13/2023Paper: Understanding and Controlling a Maze-Solving Policy Networkactivation engineeringshard theoryAI9/9/2023AI Presidents Discuss AI Alignment AgendashumorAI9/6/2023ActAdd: Steering Language Models without Optimizationactivation engineeringAI7/24/2023Open Problems in Activation Engineeringactivation engineeringAI6/20/2023Ban Development of Unpredictable Powerful Models?AI6/19/2023Mode Collapse in RL May Be Fueled by the Update Equationreinforcement learningAI6/2/2023Think Carefully Before Calling RL Policies “Agents”reinforcement learningAI5/13/2023Steering GPT-2-XL by Adding an Activation Vectoractivation engineeringshard theorymats programAI5/7/2023Residual Stream Norms Grow Exponentially over the Forward PassAI4/20/2023Behavioural Statistics for a Maze-Solving Agentmats programshard theoryAI4/1/2023Definitive Confirmation of Shard Theoryshard theoryhumorAI3/31/2023Maze-Solving Agents: Add a Top-Right Vector, Make the Agent Go to the Top-Rightactivation engineeringAI3/11/2023Understanding and Controlling a Maze-Solving Policy Networkactivation engineeringmats programshard theoryAI3/1/2023Predictions for Shard Theory Mechanistic Interpretability Resultsmats programshard theoryrationalityAI2/18/2023Parametrically Retargetable Decision-Makers Tend to Seek Powerinstrumental convergenceAI1/24/2023Some of My Disagreements with List of LethalitiesAI12/17/2022Positive Values Seem More Robust and Lasting than Prohibitionsshard theoryhuman valuesAI12/2/2022Inner and Outer Alignment Decompose One Hard Problem Into Two Extremely Hard Problemsshard theorycritiqueAI11/29/2022Alignment Allows “Non-Robust” Decision-Influences and Doesn’t Require Robust Gradingshard theoryhuman valuesAI11/26/2022Don’t Align Agents to Evaluations of PlanscritiqueAI11/18/2022Don’t Design Agents Which Exploit Adversarial InputscritiqueAI11/8/2022People Care About Each Other Even Though They Have Imperfect Motivational Pointers?corrigibilityAI10/6/2022A Shot at the Diamond-Alignment Problemshard theoryAI10/2/2022Four Usages of “Loss” In AIAI9/9/2022Understanding and Avoiding Value Drifthuman valuesshard theoryrationalityAI9/4/2022The Shard Theory of Human Valuesunderstanding the worldshard theoryhuman valuesrationalityAI8/11/2022Seriously, What Goes Wrong with “Reward the Agent when It Makes You Smile”?AI8/8/2022General Alignment Propertiesshard theoryAI7/25/2022Reward Is Not the Optimization Targetreinforcement learningshard theoryAI7/14/2022Humans Provide an Untapped Wealth of Evidence About Alignmentshard theoryhuman valuesAI6/30/2022Looking Back on My Alignment PhDgrowth storiesrationalitypersonalAI2/22/2022ELK Proposal: Thinking Via A Human ImitatorAI1/22/2022Instrumental Convergence For Realistic Agent Objectivesinstrumental convergenceAI12/3/2021Formalizing Policy-Modification CorrigibilitycorrigibilityAI11/20/2021A Certain Formalization of Corrigibility Is VNM-Incoherentinstrumental convergencecorrigibilityAI11/18/2021Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetabilityinstrumental convergenceAI8/9/2021When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentivesinstrumental convergencerationalityAI8/8/2021Seeking Power Is Convergently Instrumental in a Broad Class of Environmentsinstrumental convergenceAI7/11/2021The More Power At Stake, The Stronger Instrumental Convergence Gets For Optimal Policiesinstrumental convergenceAI7/8/2021A World in Which the Alignment Problem Seems Lower-Stakesinstrumental convergenceAI6/22/2021Environmental Structure Can Cause Instrumental Convergenceinstrumental convergenceAI6/16/2021How Can We Quantify Player Alignment in 2×2 Normal-Form Games?game theoryAI6/8/2021Conservative Agency with Multiple Stakeholdersimpact regularizationtalk notesAI6/8/2021Game-Theoretic Alignment in Terms of Attainable Utilitygame theoryAI5/26/2021MDP Models Are Determined by the Agent Architecture and the Environmental Dynamicsinstrumental convergenceAI3/22/2021Generalizing POWER to Multi-Agent Gamesinstrumental convergenceunderstanding the worldAI1/12/2021Review of “Debate on Instrumental Convergence Between LeCun, Russell, Bengio, Zador, and More”instrumental convergenceAI1/6/2021Review of “But Exactly How Complex and Fragile?”AI12/23/20202019 Review Rewrite: Seeking Power Is Often Robustly Instrumental in MDPsinstrumental convergenceAI12/12/2020Avoiding Side Effects in Complex Environmentsimpact regularizationAI11/21/2020Non-Obstruction: A Simple Concept Motivating CorrigibilitycorrigibilityAI8/1/2020Power as Easily Exploitable Opportunitiesinstrumental convergencetalk notesAI7/23/2020GPT-3 GemsAI7/20/2020To What Extent Is GPT-3 Capable of Reasoning?AI7/12/2020Formalizing “Defection” Using Game Theorygame theoryrationalityAI5/8/2020Corrigibility as Outside ViewcorrigibilityAI4/22/2020Problem Relaxation as a TacticrationalityAI2/28/2020Conclusion to “Reframing Impact”impact regularizationAI2/27/2020Attainable Utility Preservation: Scaling to Superhumanimpact regularizationAI2/27/2020Reasons for Excitement About Impact of Impact Measure Researchimpact regularizationAI2/25/2020Choosing the Strength of the Impact Penalty Termimpact regularizationAI2/22/2020Attainable Utility Preservation: Empirical Resultsimpact regularizationAI2/17/2020Attainable Utility Preservation: Conceptsimpact regularizationAI2/14/2020The Catastrophic Convergence Conjectureinstrumental convergenceimpact regularizationAI12/5/2019Seeking Power Is Often Convergently Instrumental in MDPsinstrumental convergenceAI10/10/2019Thoughts on “Human-Compatible”AI10/1/2019World State Is the Wrong Abstraction for Impactunderstanding the worldimpact regularizationAI9/20/2019Reframing Impactimpact regularizationAI9/13/2019What You See Isn’t Always What You Wantreinforcement learningAI12/28/2018Penalizing Impact via Attainable Utility Preservationimpact regularizationAI6/16/2018Worrying About the Vase: Whitelistingimpact regularizationAI3/28/2018Open-Category ClassificationAI3/25/2018The Art of the Artificial: Insights From “Artificial Intelligence: A Modern Approach”scholarship and learningAI2/26/2018Walkthrough of “Formalizing Convergent Instrumental Goals”instrumental convergenceAI