14 items with this tag.12/23/2025Recontextualization Mitigates Specification Gaming Without Modifying the Specificationspecification gamingmats programAI12/23/2025Apply for Alignment Mentorship From TurnTrout and Alex Cloudmats programshard theorycommunityAI11/22/2025Output Supervision Can Obfuscate the CoTreinforcement learningmats programAI6/13/2025Distillation Robustifies Unlearningmats programAI12/16/2024Creating Interpretable Latent Spaces with Gradient Routingmats programAI12/5/2024Gradient Routing: Masking Gradients to Localize Computation in Neural Networksmats programAI12/4/2024Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Modelsactivation engineeringmats programAI7/15/2024I Found >800 Orthogonal “Write Code” Steering Vectorsactivation engineeringmats programAI4/30/2024Mechanistically Eliciting Latent Behaviors in Language Modelsunderstanding the worldactivation engineeringmats programAI1/2/2024Steering Llama-2 with Contrastive Activation Additionsactivation engineeringcorrigibilitymats programAI5/13/2023Steering GPT-2-XL by Adding an Activation Vectoractivation engineeringshard theorymats programAI4/20/2023Behavioural Statistics for a Maze-Solving Agentmats programshard theoryAI3/11/2023Understanding and Controlling a Maze-Solving Policy Networkactivation engineeringmats programshard theoryAI3/1/2023Predictions for Shard Theory Mechanistic Interpretability Resultsmats programshard theoryrationalityAI