Mrinank, Austin, and Alex wrote a paper on the results from Understanding and controlling a maze-solving policy network, Maze-solving agents: Add a top-right vector, make the agent go to the top-right, and Behavioural statistics for a maze-solving agent.

Abstract

To understand the goals and goal representations of AI systems, we carefully study a pretrained reinforcement learning policy that solves mazes by navigating to a range of target squares. We find this network pursues multiple context-dependent goals, and we further identify circuits within the network that correspond to one of these goals. In particular, we identified eleven channels that track the location of the goal. By modifying these channels, either with hand-designed interventions or by combining forward passes, we can partially control the policy. We show that this network contains redundant, distributed, and retargetable goal representations, shedding light on the nature of goal-direction in trained policy networks.

We ran a few new experiments, including a quantitative analysis of our retargetability intervention. We’ll walk through those new results now.

A four-panel diagram showing how modifying a neural network's activations changes an AI's behavior. (a) A mouse in a maze ignores cheese to go top-right. (b) A heatmap shows activations peaking at the cheese's location. (c) A new activation peak is manually added in the top-right. (d) The mouse now follows a path to the top-right, retargeted by the modified activation.

Retargeting the mouse to a square involves increasing the probability that the mouse goes to the target location. Therefore, to see how likely the mouse is to visit any given square, Alex created a heatmap visualization:

A top-down view of a maze with an AI agent, depicted as a mouse icon, at the starting position in the bottom-left corner. Mostly, the path from the start to the top-right is bright red, with gradual degradation for squares off of that path.
Normalized path probability heatmap. The normalized path probability is the geometric average probability, under a policy, along the shortest path to a given point. It roughly measures how likely a policy is to visit that part of the maze.

The color of each maze square shows the normalized path probability for the path from the starting position in the maze to the square. In this image, we show the “base probabilities” under the unmodified policy.

For each maze square, we can try different retargeting interventions, and then plot the new normalized path probability towards that square:

Four maze heatmaps comparing AI retargeting methods, with square redness indicating how retargetable the agent is to that square. (a) Base Probability shows a limited red path. (b) Intervening on Channel 55 and (c) All Channels show progressively larger high-probability areas. (d) Directly moving the goal (cheese) is most effective, with the largest red area.

Notice the path from the bottom-left (where the mouse always starts) to the top-right corner. We call this path the top-right path. Looking at these heatmaps, it’s harder to get the mouse to go farther from the top-right path. Quantitative analysis bears out this intuition:

A line graph shows that the "Probability of Successful Retargeting" an AI agent decreases as the target's "Distance from Top Right Path" increases. The probability starts at 0.9 for zero distance, drops sharply to below 0.3 by distance 25, and levels off around 0.2 for distances up to 50.

A line graph plotting the "Ratio of Successful Retargeting" versus "Step-Distance from Top Right Path." The ratio peaks at 5.0 for targets near the path, then decreases with distance but remains above 1.0, indicating retargeting always increases the probability of reaching a tile.

A line graph shows that as maze size increases, the average probability of retargeting success decreases. It compares three interventions: "All Cheese Channels" (highest success), followed by "Effective Channels," and "Channel 55" (lowest), showing that modifying more channels is more effective.

Overall, these new results quantify how well we can control the policy via the internal goal representations which we identified.

Thanks

Thanks to Lisa Thiergart for helping handle funding and set up the project. Thanks to the ltff and Lightspeed grants for funding this project.

Find out when I post more content: newsletter & rss

Thoughts? Email me at alex@turntrout.com (pgp)