The Pond

SearchSearch

Search

  • About me
  • My research
  • All posts
  • Open source
  • Subscribe

Tag: activation engineering

12 items with this tag.

  • 11/6/2025

    Consistency Training Helps Stop Sycophancy and Jailbreaks

    • activation engineering
    • deepmind
    • AI

  • 1/30/2025

    Steering Gemini Using BIDPO Vectors

    • activation engineering
    • deepmind
    • AI

  • 12/4/2024

    Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models

    • activation engineering
    • mats program
    • AI

  • 7/15/2024

    I Found >800 Orthogonal “Write Code” Steering Vectors

    • activation engineering
    • mats program
    • AI

  • 4/30/2024

    Mechanistically Eliciting Latent Behaviors in Language Models

    • understanding the world
    • activation engineering
    • mats program
    • AI

  • 1/2/2024

    Steering Llama-2 with Contrastive Activation Additions

    • activation engineering
    • corrigibility
    • mats program
    • AI

  • 10/13/2023

    Paper: Understanding and Controlling a Maze-Solving Policy Network

    • activation engineering
    • shard theory
    • AI

  • 9/6/2023

    ActAdd: Steering Language Models without Optimization

    • activation engineering
    • AI

  • 7/24/2023

    Open Problems in Activation Engineering

    • activation engineering
    • AI

  • 5/13/2023

    Steering GPT-2-XL by Adding an Activation Vector

    • activation engineering
    • shard theory
    • mats program
    • AI

  • 3/31/2023

    Maze-Solving Agents: Add a Top-Right Vector, Make the Agent Go to the Top-Right

    • activation engineering
    • AI

  • 3/11/2023

    Understanding and Controlling a Maze-Solving Policy Network

    • activation engineering
    • mats program
    • shard theory
    • AI