The Pond

Search

About me
My research
Random post
All posts
Open source
Subscribe

Tag: mats program

16 items with this tag.

7/9/2026
How Robust Are Natural Language Autoencoders to Initialization?
- mats program
- AI
5/24/2026
Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming
12/23/2025
Recontextualization Mitigates Specification Gaming Without Modifying the Specification
12/23/2025
Apply for Alignment Mentorship From TurnTrout and Alex Cloud
11/22/2025
Output Supervision Can Obfuscate the CoT
6/13/2025
Distillation Robustifies Unlearning
- mats program
- AI
12/16/2024
Creating Interpretable Latent Spaces with Gradient Routing
12/5/2024
Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
12/4/2024
Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models
7/15/2024
I Found >800 Orthogonal “Write Code” Steering Vectors
4/30/2024
Mechanistically Eliciting Latent Behaviors in Language Models
1/2/2024
Steering Llama-2 with Contrastive Activation Additions
5/13/2023
Steering GPT-2-XL by Adding an Activation Vector
4/20/2023
Behavioural Statistics for a Maze-Solving Agent
3/11/2023
Understanding and Controlling a Maze-Solving Policy Network
3/1/2023
Predictions for Shard Theory Mechanistic Interpretability Results