The Pond

SearchSearch

Search

  • About me
  • My research
  • All posts
  • Open source
  • Subscribe

Tag: mats program

14 items with this tag.

  • 12/23/2025

    Recontextualization Mitigates Specification Gaming Without Modifying the Specification

    • specification gaming
    • mats program
    • AI

  • 12/23/2025

    Apply for Alignment Mentorship From TurnTrout and Alex Cloud

    • mats program
    • shard theory
    • community
    • AI

  • 11/22/2025

    Output Supervision Can Obfuscate the CoT

    • reinforcement learning
    • mats program
    • AI

  • 6/13/2025

    Distillation Robustifies Unlearning

    • mats program
    • AI

  • 12/16/2024

    Creating Interpretable Latent Spaces with Gradient Routing

    • mats program
    • AI

  • 12/5/2024

    Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

    • mats program
    • AI

  • 12/4/2024

    Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models

    • activation engineering
    • mats program
    • AI

  • 7/15/2024

    I Found >800 Orthogonal “Write Code” Steering Vectors

    • activation engineering
    • mats program
    • AI

  • 4/30/2024

    Mechanistically Eliciting Latent Behaviors in Language Models

    • understanding the world
    • activation engineering
    • mats program
    • AI

  • 1/2/2024

    Steering Llama-2 with Contrastive Activation Additions

    • activation engineering
    • corrigibility
    • mats program
    • AI

  • 5/13/2023

    Steering GPT-2-XL by Adding an Activation Vector

    • activation engineering
    • shard theory
    • mats program
    • AI

  • 4/20/2023

    Behavioural Statistics for a Maze-Solving Agent

    • mats program
    • shard theory
    • AI

  • 3/11/2023

    Understanding and Controlling a Maze-Solving Policy Network

    • activation engineering
    • mats program
    • shard theory
    • AI

  • 3/1/2023

    Predictions for Shard Theory Mechanistic Interpretability Results

    • mats program
    • shard theory
    • rationality
    • AI