The Pond

Search

About me
My research
All posts
Open source
Subscribe

Tag: AI

94 items with this tag.

11/6/2025
Consistency Training Helps Stop Sycophancy and Jailbreaks

7/23/2025
We Built a Tool to Protect Your Dataset From Simple Scrapers

6/29/2025
A Simple Explanation of AGI Risk

6/13/2025
Distillation Robustifies Unlearning

3/1/2025
Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models

1/30/2025
Steering Gemini Using BIDPO Vectors

1/21/2025
Output Supervision Can Obfuscate the CoT

1/15/2025
Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses

12/16/2024
Creating Interpretable Latent Spaces with Gradient Routing

12/5/2024
Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

12/4/2024
Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models

10/30/2024
Intrinsic Power-Seeking: AI Might Seek Power for Power’s Sake

10/27/2024
My Research

10/27/2024
Can Transformers Act on Information Beyond an Effective Layer Horizon?

7/15/2024
I Found >800 Orthogonal “Write Code” Steering Vectors

4/30/2024
Mechanistically Eliciting Latent Behaviors in Language Models

3/5/2024
Many Arguments for AI X-Risk Are Wrong

2/10/2024
Dreams of AI Alignment: The Danger of Suggestive Names

1/19/2024
Don’t Use the “Shoggoth” Meme to Portray LLMs

1/2/2024
Steering Llama-2 with Contrastive Activation Additions

10/13/2023
Paper: Understanding and Controlling a Maze-Solving Policy Network

9/9/2023
AI Presidents Discuss AI Alignment Agendas

9/6/2023
ActAdd: Steering Language Models without Optimization

7/24/2023
Open Problems in Activation Engineering

6/20/2023
Ban Development of Unpredictable Powerful Models?

6/19/2023
Mode Collapse in RL May Be Fueled by the Update Equation

6/2/2023
Think Carefully Before Calling RL Policies “Agents”

5/13/2023
Steering GPT-2-XL by Adding an Activation Vector

5/7/2023
Residual Stream Norms Grow Exponentially over the Forward Pass

4/20/2023
Behavioural Statistics for a Maze-Solving Agent

4/1/2023
Definitive Confirmation of Shard Theory

3/31/2023
Maze-Solving Agents: Add a Top-Right Vector, Make the Agent Go to the Top-Right

3/11/2023
Understanding and Controlling a Maze-Solving Policy Network

3/1/2023
Predictions for Shard Theory Mechanistic Interpretability Results

2/18/2023
Parametrically Retargetable Decision-Makers Tend to Seek Power

1/24/2023
Some of My Disagreements with List of Lethalities

12/17/2022
Positive Values Seem More Robust and Lasting than Prohibitions

12/2/2022
Inner and Outer Alignment Decompose One Hard Problem Into Two Extremely Hard Problems

11/29/2022
Alignment Allows “Non-Robust” Decision-Influences and Doesn’t Require Robust Grading

11/26/2022
Don’t Align Agents to Evaluations of Plans

11/18/2022
Don’t Design Agents Which Exploit Adversarial Inputs

11/8/2022
People Care About Each Other Even Though They Have Imperfect Motivational Pointers?

10/6/2022
A Shot at the Diamond-Alignment Problem

10/2/2022
Four Usages of “Loss” In AI

9/9/2022
Understanding and Avoiding Value Drift

9/4/2022
The Shard Theory of Human Values

8/11/2022
Seriously, What Goes Wrong with “Reward the Agent when It Makes You Smile”?

8/8/2022
General Alignment Properties

7/25/2022
Reward Is Not the Optimization Target

7/14/2022
Humans Provide an Untapped Wealth of Evidence About Alignment

6/30/2022
Looking Back on My Alignment PhD

2/22/2022
ELK Proposal: Thinking Via A Human Imitator

1/22/2022
Instrumental Convergence For Realistic Agent Objectives

12/3/2021
Formalizing Policy-Modification Corrigibility

11/20/2021
A Certain Formalization of Corrigibility Is VNM-Incoherent

11/18/2021
Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability

8/9/2021
When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives

8/8/2021
Seeking Power Is Convergently Instrumental in a Broad Class of Environments

7/11/2021
The More Power At Stake, The Stronger Instrumental Convergence Gets For Optimal Policies

7/8/2021
A World in Which the Alignment Problem Seems Lower-Stakes

6/22/2021
Environmental Structure Can Cause Instrumental Convergence

6/16/2021
How Can We Quantify Player Alignment in 2×2 Normal-Form Games?

6/8/2021
Conservative Agency with Multiple Stakeholders

6/8/2021
Game-Theoretic Alignment in Terms of Attainable Utility

5/26/2021
MDP Models Are Determined by the Agent Architecture and the Environmental Dynamics

3/22/2021
Generalizing POWER to Multi-Agent Games

1/12/2021
Review of “Debate on Instrumental Convergence Between LeCun, Russell, Bengio, Zador, and More”

1/6/2021
Review of “But Exactly How Complex and Fragile?”

12/23/2020
2019 Review Rewrite: Seeking Power Is Often Robustly Instrumental in MDPs

12/12/2020
Avoiding Side Effects in Complex Environments

11/21/2020
Non-Obstruction: A Simple Concept Motivating Corrigibility

8/1/2020
Power as Easily Exploitable Opportunities

7/23/2020
GPT-3 Gems

7/20/2020
To What Extent Is GPT-3 Capable of Reasoning?

7/12/2020
Formalizing “Defection” Using Game Theory

5/8/2020
Corrigibility as Outside View

4/22/2020
Problem Relaxation as a Tactic

2/28/2020
Conclusion to “Reframing Impact”

2/27/2020
Attainable Utility Preservation: Scaling to Superhuman

2/27/2020
Reasons for Excitement About Impact of Impact Measure Research

2/25/2020
Choosing the Strength of the Impact Penalty Term

2/22/2020
Attainable Utility Preservation: Empirical Results

2/17/2020
Attainable Utility Preservation: Concepts

2/14/2020
The Catastrophic Convergence Conjecture

12/5/2019
Seeking Power Is Often Convergently Instrumental in MDPs

10/10/2019
Thoughts on “Human-Compatible”

10/1/2019
World State Is the Wrong Abstraction for Impact

9/20/2019
Reframing Impact

9/13/2019
What You See Isn’t Always What You Want

12/28/2018
Penalizing Impact via Attainable Utility Preservation

6/16/2018
Worrying About the Vase: Whitelisting

3/28/2018
Open-Category Classification

3/25/2018
The Art of the Artificial: Insights From “Artificial Intelligence: A Modern Approach”

2/26/2018
Walkthrough of “Formalizing Convergent Instrumental Goals”