Table of contents
 Impressions from trajectory videos
 Statistically informed impressions
 Procedure and detailed results
 Operationalizing intuitive maze properties
 Individual regression results: cheesetodecisionsquare and cheesetotopright distances are predictive
 Variables are highly correlated, so we are on rocky statistical terrain
 Finding stably predictive variables with multiple regressions
 Testing redundancy between spatial and stepwise distances
 Conclusion
 Footnotes
Understanding and controlling a mazesolving policy network analyzed a mazesolving agent’s behavior. We isolated four maze properties which seemed to predict whether the mouse goes towards the cheese or towards the topright corner:
In this post, we conduct a more thorough statistical analysis, addressing issues of multicollinearity. We show strong evidence that (2) and (3) above are real influences on the agent’s decisionmaking. We show weak evidence that (1) is also a real influence. As we speculated in the original post,^{1} (4) falls away as a statistical artifact.
ContributionsPeli did the stats work and drafted the post, while Alex provided feedback, expanded the visualizations, and ran additional tests for multicollinearity. Some of the work completed in Team Shard under seri mats 3.0.
Watching videos Langosco et al.’s experiment, we developed a few central intuitions about how the agent behaves. In particular, we tried predicting what the agent does at decision squares.
Some mazes are easy to predict, because the cheese is on the way to the topright corner. There’s no decision square where the agent has to make the hard choice between the paths to the cheese and to the topright corner:
Here are four central intuitions which we developed:
 Closeness between the mouse and the cheese makes cheesegetting more likely
 Closeness between the mouse or cheese and the topright makes cheesegetting more likely
 The effect of closeness is smooth
 Both ‘spatial’ distances and ‘legal steps’ distances matter when computing closeness in each case
The videos we studied are hard to interpret without quantitative tools, so we regard these intuitions as theoreticallymotivated impressions rather than as observations. We wanted to precisify and statistically test these impressions, with an eye to their potential theoretical significance.
We suspect that the agent’s conditions for pursuing cheese generalize properties of historically reinforced cheesedirected moves in a very “soft” way. Consider that movements can be “directed” on paths towards the cheese, the topright corner, both, or neither. In the training environment, unambiguously cheesedirected movements are towards a cheese square that is both close to the mouse’s current position and close to the topright.^{2}
Our impression is that in the test environment, “closeness to topright” and “closeness to cheese” each become a decisionfactor that encourages cheesedirected movement in proportion to “how strongly” the historical condition holds at present. In shard theory terminology, the topright and cheeseshards seem to activate more strongly in situations which are similar to historical reinforcement events.
A second important aspect of our impressions was that the generalization process “interprets” each historical condition in multiple ways. For example, it seemed to us that multiple kinds of distance between the decisionsquare and cheese may each have an effect on the agent’s decision making.
Our revised, precisified impressions about the agent’s behavior on decisionsquares are as follows:
 Legalsteps closeness between the mouse and the cheese makes cheesegetting more likely
 Low $d_{step}(decisionsquare,cheese)$ increases P(cheese acquired)
 Spatial closeness between the cheese and topright makes cheesegetting more likely
 Low $d_{Euclidean}(cheese,topright)$ increases P(cheese acquired)
 The effect of closeness is fairly smooth
 These distances smoothly affect P(cheese acquired), without rapid jumps or thresholding
 Spatial closeness between the mouse and the cheese makes cheesegetting slightly more likely, even after controlling for legalsteps closeness (low confidence)
After extensive but nonrigorous statistical analysis (our stats consultant tells us there are no lowoverhead rigorous methods applicable to our situation), we believe that we have strong quantitative evidence in favor of versions of impressions 1) through 3), and weak quantitative evidence in favor of a version of impression 4).
Because our statistical procedure is nonrigorous, we are holding off on drawing strong conclusions from these impressions until we have a more robust, mechanisticinterpretability informed understanding of the underlying dynamics.
One question that interests us, however, is whether these impressions point to a decisionmaking process that is more ‘shardlike’ than ‘utilitytheoretic’ in character. When we originally studied testrun videos, we wondered whether the apparent “closeness effects'' could be explained by a simple utility function with timediscounting (for example a fixed value cheesegoal and fixed value cornergoal). The evidence that at least some spatial closeness effects are irreducible to legalsteps closeness seem to rule out such simple utility functions, since only legalsteps closeness matters for timediscounting:
Our current intuition is that a predictively strong utility function needs to incorporate spatial distances in multiple complex ways.
We think the complex influence of spatial distances on the network’s decisionmaking might favor a ‘shardlike’ description: a description of the network’s decisions as coalitions between heuristic submodules whose votingpower varies based on context. While this is still an underdeveloped hypothesis, it’s motivated by two lines of thinking.
First, we weakly suspect that the agent may be systematically^{3} dynamically inconsistent from a utilitytheoretic perspective. That is, the effects of $d_{step}(mouse,cheese)$ and (potentially) $d_{Euclidean}(cheese,topright)$ might turn out to call for a behavior model where the agent’s priorities in a given maze change based on the agent’s current location.
Second, we suspect that if the agent is dynamically consistent, a shardlike description may allow for a more compact and natural statement of an otherwise very gerrymanderedsounding utility function that fixes the value of cheese and topright in a maze based on a “strange” mixture of maze properties. It may be helpful to look at these properties in terms of similarities to the historical activation conditions of different submodules that favor different plans.^{4}
While we consider our evidence suggestive in these directions, it’s possible that some simple but clever utility function will turn out to be predictively successful. For example, consider our two strongly observed effects: $d_{Euclidean}(cheese,topright)$and $d_{step}(decisionsquare,cheese)$. We might explain these effects by stipulating that:
 On each turn, the agent receives value inverse to the agent’s distance from the topright,
 Sharing a square with the cheese adds constant value,
 The agent doesn’t know that getting to the cheese ends the game early, and
 The agent timediscounts.
We’re somewhat skeptical that models of this kind will hold up once you crunch the numbers and look at scenariopredictions, but they deserve a fair shot.
We hope to revisit these questions rigorously when our mechanistic understanding of the network has matured.
NoteOur analysis can be run in this Colab.
Our first step to statistically evaluating our initial impressions about the network’s behavior was to operationalize the concepts featured in our impressions. And since we suspected that the training process generalizes historically significant properties in multiple simultaneous ways, we came up with multiple operationalizations of each relevant concept when possible:
 “Topright”
topright maze square
or5x5 squares area starting from topright maze square
 “Distance”
legalsteps distance
orEuclidean distance
 “Distance to topright”
cheese closeness to topright
ordecisionsquare closeness to topright
 “Distance to cheese”
decisionsquare closeness to cheese
Our next step was to generate every operationalization of ‘closeness to topright’ and ‘closeness to cheese’ we can construct using these concepts, and do a logistic regression on each to measure its power to predict whether the agent gets the cheese.^{5}
We generated 10,000 trajectories (each in a different random seed) and screened them for levels which actually contain a decisionsquare. We were left with 5,239 levels meeting this criterion. We trained a regression model to predict whether the agent gets the cheese in any given seed. The baseline performance (either guessing “always cheese” or “never cheese”) gets an accuracy of 71.4%.
We performed logistic regression on each variable mentioned above, using a set of 10,000 runs with a randomized 80% training / 20% validation split and averaged over 1,000 trials. That is, we train regression models with single variable, and see what the accuracy is.
Out of 11 variables, 6 variables beat the ‘no regression’ accuracy baseline of 71.4%:
Variable  Prediction accuracy 

Euclidean distance between cheese and topright 5×5  0.775 
Euclidean distance between cheese and topright square  0.773 
Euclidean distance between cheese and decisionsquare  0.761 
Steps between cheese and decisionsquare  0.754 
Steps between cheese and topright 5×5  0.735 
Steps between cheese and topright square  0.732 
The remaining 5 variables were worse than nothing:
Variable  Prediction accuracy 

Cheese coordinates norm  0.713 
Euclidean distance between decisionsquare and topright square  0.712 
Steps between decisionsquare and topright square  0.709 
Steps between decisionsquare and topright 5×5  0.708 
Euclidean distance between decisionsquare and topright 5×5  0.708 
Note that in these individual regressions, all successfully predictive variables have a negative coefficient—this makes sense, since the variables measure distance and our impression was that various forms of closeness motivate cheesegetting.
As we move on to multiple regressions to try finding out which variables drive these results, we have to work carefully: our various operationalizations of ‘closeness’ in the mazes are inevitably pretty correlated.
I’d be [wary] about interpreting the regression coefficients of features that are correlated (see Multicollinearity). Even the sign may be misleading.
It might be worth making a crosscorrelation plot of the features. This won’t give you a new coefficients to put faith in, but it might help you decide how much to trust the ones you have. It can also be useful looking at how unstable the coefficients are during training (or e.g. when trained on a different dataset).
There is indeed a strong correlation between two of our highly predictive variables:
We then computed the variation inflation factors for the three predictive variables we end up analyzing in detail. vif measures how collinearity increases the variance of the regression coefficients. A score exceeding 4 is considered to be a warning sign of multicollinearity.
Attribute  vif 

Euclidean distance between cheese and topright square  1.05 
Steps between cheese and decisionsquare  4.64 
Euclidean distance between cheese and decisionsquare  4.66 
Our statistician friend suggested that in situations like this it’s most instructive to look at which individually predictive variables affect prediction accuracy when we add/drop them in a multiple regression, watching out for signflips. The procedure isn’t fully rigorous, but since much of our evidence is backed by qualitative ‘mazeediting’ experiments and domain knowledge, we are relatively confident in some conclusions.
Let’s take the predictively successful variables from the individual regressions—the variables that scored better than ‘noregression’—and perform an L_{1} regularized multiple regression to see which variables remain predictive without signflipping.
Attribute  Coefficient 

Steps between cheese and topright 5×5  0.003 
Euclidean distance between cheese and topright 5×5  0.282 
Steps between cheese and topright square  1.142 
Euclidean distance between cheese and topright square  2.522 
Steps between cheese and decisionsquare  1.200 
Euclidean distance between cheese and decisionsquare  0.523 
Intercept  1.418 
We see that three of our individually predictive variables made it through without a signflip:
 Euclidean distance from cheese to topright square
 Legal steps distance from decisionsquare to cheese
 Euclidean distance from decisionsquare to cheese
Variables 1–3) lineup with our best guesses about mechanisms based on informal observation and (messy) exploratory statistics, so it’s good news that the simple procedure ‘check which individually significant variables don’t signflip’ recovers them.
These are also the three main features which we noted in the original post. (We had noted that the fourth feature $d_{Euclidean}(decisionsquare,5x5)$ has a strange, positive regression coefficient, which we thought was probably an artifact. Our further analysis supports our initial speculation.)
We’ve repeated this particular test dozens of time and got very consistent results: individually predictive variables outside 1–3) always go near zero or signflip. Results also remained consistent on a second batch of 10,000 testruns. Considering a range of regressions on a range of train/validation splits, the regression coefficient signs of 1–3) are very stable. The magnitudes^{6} of the regression coefficients fluctuate a bit across regressions and splits, but are reasonably stable.
Furthermore, we regressed upon 200 random subsets of our variables, and the cheese/decisionsquare distance regression coefficients never experienced a sign flip. The cheese/topright Euclidean distance had a few sign flips. Other variables signflip much more frequently.
We consider this to be strong evidence against multicollinearity having distorted our original regressions.
Are variables 1–3) ‘enough’ to explain the network’s behavior? Let’s see how much predictive accuracy we retain when regressing only on 1–3).
Attribute  Coefficient 

Euclidean distance between cheese and topright square  1.405 
Steps between cheese and decisionsquare  0.577 
Euclidean distance between cheese and decisionsquare  0.516 
Intercept  1.355 
There is a 1.7% accuracy drop compared to the original multiple regression. Unfortunately, it’s hard to interpret this accuracy gap in terms of the contributions of individual variables outside 1–3). Adding practically any 4th variable to 1–3) flips delivers big accuracy gains that don’t additively accrue when combined, and the new variable’s sign is often flipped relative to its singleregression sign.
See for example 1–3) + ‘legal steps from cheese to topright square’:
Attribute  Coefficient 

Steps between cheese and topright square  1.099 
Euclidean distance between cheese and topright square  2.181 
Steps between cheese and decisionsquare  1.211 
Euclidean distance between cheese and decisionsquare  0.515 
Intercept  1.380 
Or 1–3) + ‘legal steps from cheese to topright square’ + ‘Euclidean distance from decisionsquare to topright 5×5’:
Attribute  Coefficient 

Euclidean distance between decisionsquare and topright 5×5  1.239 
Steps between cheese and topright square  0.038 
Euclidean distance between cheese and topright square  2.652 
Steps between cheese and decisionsquare  0.911 
Euclidean distance between cheese and decisionsquare  0.419 
Intercept  1.389 
Our instinct is therefore to avoid interpreting variables like ‘Euclidean distance from decisionsquare to 5×5’ or ‘legal steps distance from cheese to topright square.’ Additional experimentation shows that these variables are only predictive in settings where they signflip relative to their singleregression coefficients, that their predictive powers don’t stack, and that their statistical effects do not correspond to any intuitive mechanism.
Let’s get back to our claimed predictive variables:
 Euclidean distance from cheese to topright square
 Legal steps distance from decisionsquare to cheese
 Euclidean distance from decisionsquare to cheese
How sure should we be that variables 1–3) each track a real and distinct causal mechanism?
For variables 1) and 2), we have extensive though nonrigorous experience making manual mazeedits that decrease/increase cheesegetting by changing the relevant distance with minimal logical sideeffects. For example, increasing the number of legal steps from decisionsquare to cheese while keeping all Euclidean distances the same reliably reduces the probability that the agent moves in the cheese direction:^{7}
Our experience making similar mazeedits for variable 3) has been mixed and limited, as they are harder to produce. Still, the results of edits that manipulate 3) are often suggestive (if hard to interpret).
Keeping these qualitative impressions in mind, let’s test variables 1–3) for statistical redundancy by dropping variables and seeing how that impacts accuracy.
Regression variables  Accuracy 

$d_{Euclidean}(cheese,topright)$ $d_{step}(cheese,decisionsquare)$ $d_{Euclidean}(cheese,decisionsquare)$  82.4% 
$d_{step}(cheese,decisionsquare)$ $d_{Euclidean}(cheese,decisionsquare)$  75.9% 
$d_{Euclidean}(cheese,topright)$ $d_{Euclidean}(cheese,decisionsquare)$  81.9% 
$d_{Euclidean}(cheese,topright)$ $d_{step}(cheese,decisionsquare)$  81.7% 
$d_{Euclidean}(cheese,topright)$  77.3% 
Considering our qualitative and statistical results together, we are confident that $d_{step}(cheese,decisionsquare)$ tracks a real decision influence.
We weakly believe that $d_{Euclidean}(cheese,decisionsquare)$ tracks an additional real decision influence. More evidence for this is that removing the cheese/square distances cause comparable accuracy drops. And we’re already confident that $d_{step}(cheese,decisionsquare)$ tracks a real decisioninfluence!
Our biggest source of doubt about $d_{Euclidean}(cheese,decisionsquare)$ is that when running regression on another independent batch of 10,000 testruns we found no loss at all when dropping this variable from 1–3). This was surprising, since we were otherwise able to reproduce all our qualitative results (e.g. rankings of variables’ predictive strength, signflipping patterns) across sample batches.^{8}
Our statistics refine, support, and stresstest our impressions about the network’s behavior. This behavior seems more easily describable using a shard theory frame than a utility frame. We think our statistical results are not artifacts of multicollinearity, but hold up quite well.^{9}
However, the statistics are not fully rigorous, and this post’s analysis contained freeform domainspecific reasoning. That said, we are overall very confident that the agent is influenced by $d_{Euclidean}(cheese,topright)$ and by $d_{step}(cheese,decisionsquare)$. We have weak but suggestive evidence for additional influence from $d_{Euclidean}(cheese,decisionsquare)$.
Find out when I post more content: newsletter & RSS
alex@turntrout.com

[Regression factor] (4) is an interesting outlier which probably stems from not using a more sophisticated structural model for regression.

Counterexamples are possible but likely to be statistically insignificant. We haven’t formally checked whether counterexamples can be found in the training set. ⤴

We think it’s clear that the agent cannot be perfectly characterized by any reasonable utilitytheoretic description, let alone a timeconsistent utility function over state variables like “cheese” and “topright.” What’s at stake here is the question of the best systematic approximation of the agent’s behaviour. ⤴

The question ‘does the agent have the same goal at every timestep in a given maze?’ requires looking at more than one timestep in a given maze. Therefore, statistics on the agent’s behaviour on the decisionsquare alone cannot distinguish between a dynamically inconsistent agent and an equilibrated agent whose utility function has a shardlike explanation.
However, actionprobability vector field plots display information about all possible maze locations. These plots are a valuable source of evidence on whether the agent is dynamically consistent. ⤴

We also added one more variable: the norm of the cheese’s coordinates in the network’s reflective field. The norm represents a “minimalist” interpretation of the effect of cheesecloseness to the topright. (The topright square of the maze varies level to level and requires sophisticated global computations to identify, whereas coordinates information is static.) ⤴

We don’t mean for our analysis to be predicated on the magnitudes of the regression coefficents. We know these are unreliable and contingent quantities! We mentioned their relative stability more as diagnostic evidence. ⤴

Our manual interventions look directly at the probability of making a first move towards cheese at the decisionsquare, rather than at the frequency of cheesegetting. This is especially useful when studying the influence of legalsteps distance, since the effect on cheesegetting could be an artifact of the shorter chain of ‘correct’ stochastic outcomes required to take the cheese when the stepdistance is short. ⤴

We suspect that we would observe a clearer effect for $d_{Euclidean}(cheese,decisionsquare)$ if we did statistics on action logits around the decisionsquare instead of on cheesegetting frequencies, but there’s substantial overhead to getting these statistics. ⤴

The main thing Alex would have changed about the original post is to not make the $d_{Euclidean}(cheese,decisionsquare)$ influence a headline result (in the summary). ⤴