Dark mode
Search
84 items with this tag.
Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models
Intrinsic Power-Seeking: AI Might Seek Power for Power’s Sake
My Research
Can Transformers Act on Information Beyond an Effective Layer Horizon?
I Found >800 Orthogonal “Write Code” Steering Vectors
Mechanistically Eliciting Latent Behaviors in Language Models
Many Arguments for AI X-Risk Are Wrong
Dreams of AI Alignment: The Danger of Suggestive Names
Steering Llama-2 with Contrastive Activation Additions
Paper: Understanding and Controlling a Maze-Solving Policy Network
AI Presidents Discuss AI Alignment Agendas
ActAdd: Steering Language Models without Optimization
Open Problems in Activation Engineering
Ban Development of Unpredictable Powerful Models?
Mode Collapse in RL May Be Fueled by the Update Equation
Think Carefully Before Calling RL Policies “Agents”
Steering GPT-2-XL by Adding an Activation Vector
Residual Stream Norms Grow Exponentially over the Forward Pass
Behavioural Statistics for a Maze-Solving Agent
Definitive Confirmation of Shard Theory
Maze-Solving Agents: Add a Top-Right Vector, Make the Agent Go to the Top-Right
Understanding and Controlling a Maze-Solving Policy Network
Predictions for Shard Theory Mechanistic Interpretability Results
Parametrically Retargetable Decision-Makers Tend to Seek Power
Some of My Disagreements with List of Lethalities
Positive Values Seem More Robust and Lasting than Prohibitions
Inner and Outer Alignment Decompose One Hard Problem Into Two Extremely Hard Problems
Alignment Allows “Non-Robust” Decision-Influences and Doesn’t Require Robust Grading
Don’t Align Agents to Evaluations of Plans
Don’t Design Agents Which Exploit Adversarial Inputs
People Care About Each Other Even Though They Have Imperfect Motivational Pointers?
A Shot at the Diamond-Alignment Problem
Four Usages of “Loss” In AI
Understanding and Avoiding Value Drift
The Shard Theory of Human Values
Seriously, What Goes Wrong with “Reward the Agent when It Makes You Smile”?
General Alignment Properties
Reward Is Not the Optimization Target
Humans Provide an Untapped Wealth of Evidence About Alignment
Looking Back on My Alignment PhD
ELK Proposal: Thinking Via A Human Imitator
Instrumental Convergence For Realistic Agent Objectives
Formalizing Policy-Modification Corrigibility
A Certain Formalization of Corrigibility Is VNM-Incoherent
Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability
When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives
Seeking Power Is Convergently Instrumental in a Broad Class of Environments
The More Power At Stake, The Stronger Instrumental Convergence Gets For Optimal Policies
A World in Which the Alignment Problem Seems Lower-Stakes
Environmental Structure Can Cause Instrumental Convergence
Open Problem: How Can We Quantify Player Alignment in 2×2 Normal-Form Games?
Conservative Agency with Multiple Stakeholders
Game-Theoretic Alignment in Terms of Attainable Utility
MDP Models Are Determined by the Agent Architecture and the Environmental Dynamics
Generalizing POWER to Multi-Agent Games
Review of “Debate on Instrumental Convergence Between LeCun, Russell, Bengio, Zador, and More”
Review of “But Exactly How Complex and Fragile?”
2019 Review Rewrite: Seeking Power Is Often Robustly Instrumental in MDPs
Avoiding Side Effects in Complex Environments
Non-Obstruction: A Simple Concept Motivating Corrigibility
Power as Easily Exploitable Opportunities
GPT-3 Gems
To What Extent Is GPT-3 Capable of Reasoning?
Formalizing “Defection” Using Game Theory
Corrigibility as Outside View
Problem Relaxation as a Tactic
Conclusion to “Reframing Impact”
Attainable Utility Preservation: Scaling to Superhuman
Reasons for Excitement About Impact of Impact Measure Research
Choosing the Strength of the Impact Penalty Term
Attainable Utility Preservation: Empirical Results
Attainable Utility Preservation: Concepts
The Catastrophic Convergence Conjecture
Seeking Power Is Often Convergently Instrumental in MDPs
Thoughts on “Human-Compatible”
World State Is the Wrong Abstraction for Impact
Reframing Impact
What You See Isn’t Always What You Want
Penalizing Impact via Attainable Utility Preservation
Worrying About the Vase: Whitelisting
Open-Category Classification
The Art of the Artificial: Insights From “Artificial Intelligence: A Modern Approach”
Walkthrough of “Formalizing Convergent Instrumental Goals”