The Pond

SearchSearch

Search

  • About me
  • My research
  • All posts
  • Open source
  • Subscribe

Tag: mats program

12 items with this tag.

  • 6/13/2025

    Distillation Robustifies Unlearning

      mats programAI

  • 1/21/2025

    Output Supervision Can Obfuscate the CoT

      reinforcement learningmats programAI

  • 12/16/2024

    Creating Interpretable Latent Spaces with Gradient Routing

      mats programAI

  • 12/5/2024

    Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

      mats programAI

  • 12/4/2024

    Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models

      activation engineeringmats programAI

  • 7/15/2024

    I Found >800 Orthogonal “Write Code” Steering Vectors

      activation engineeringmats programAI

  • 4/30/2024

    Mechanistically Eliciting Latent Behaviors in Language Models

      understanding the worldactivation engineeringmats programAI

  • 1/2/2024

    Steering Llama-2 with Contrastive Activation Additions

      activation engineeringcorrigibilitymats programAI

  • 5/13/2023

    Steering GPT-2-XL by Adding an Activation Vector

      activation engineeringshard theorymats programAI

  • 4/20/2023

    Behavioural Statistics for a Maze-Solving Agent

      mats programshard theoryAI

  • 3/11/2023

    Understanding and Controlling a Maze-Solving Policy Network

      activation engineeringmats programshard theoryAI

  • 3/1/2023

    Predictions for Shard Theory Mechanistic Interpretability Results

      mats programshard theoryrationalityAI