The residual stream norm grows exponentially over the forward pass, with a growth rate of about 1.05. Consider the residual stream at layer 0, with norm (say) of 100. Suppose the mlp heads at layer 0 have outputs of norm (say) 5. Then after 30 layers, the residual stream norm will be . Then the mlp-0 outputs of norm 5 should have a significantly reduced effect on the computations of mlp-30 due to their smaller relative norm.

On input tokens , let be the original model’s sublayer outputs at layer . I want to think about what happens when the later sublayers can only “see” the last few layers’ worth of outputs.

Definition: Layer-truncated residual stream

A truncated residual stream from layer to layer is formed by the original sublayer outputs from those layers.

Definition: Effective layer horizon

Let be an integer. Suppose that for all , we patch in for the usual residual stream inputs .1 Let the effective layer horizon be the smallest for which the model’s outputs and/or capabilities are “qualitatively unchanged.”

Effective layer horizons (if they exist) would greatly simplify searches for circuits within models. Additionally, they would be evidence against hypotheses like Residual Networks Behave Like Ensembles of Relatively Shallow Networks because serial circuits would need to be deep:

Low effective layer horizon implies that later layers are building more on the outputs of intermediate layers. In one extreme, a network with an effective layer horizon of 1 would only consist of circuits that route through every single layer. Likewise, for there to be any extremely shallow circuits that route directly from the inputs to the final layer, the effective layer horizon must be the number of layers in the network.

Lastly, slower norm growth probably causes the effective layer horizon to be lower. In that case, simply measuring residual stream norm growth would tell you a lot about the depth of circuits in the model, which could be useful if you want to regularize against that or otherwise decrease it (e.g. to decrease the amount of effective serial computation).

Question

Do models have an effective layer horizon? If so, what is it, as a function of model depth and other factors—are there scaling laws?

To measure the importance of sublayer contributions originating much earlier in the forward pass, Joseph Miller modified the forward pass so that each sublayer reads a residual stream formed from the outputs form the previous sublayers. He then measured how loss changes as a function of enforced layer horizon . A larger loss spike means that that information was more important. On the other hand, if you can remove all but the last three layers and suffer minimal loss increase, the earlier outputs evidently aren’t very important beyond a few layers.

Joseph Miller reports that gpt-2 small seems too small to exhibit an effective layer horizon. However, he then ran experiments on gpt-2-xl.

We clearly see the same pattern again. As TurnTrout predicted, there seems be something like an exponential decay in the importance of previous layers as you go further back. I expect that on large models the effective layer horizon is an important consideration. (Source code)

However, there is a confounder. Suppose the layer horizon is 1. Consider the computation performed by layer 0’s attention and mlp sublayers. Because layer 0 already uses all relevant information in its outputs (i.e. the embeddings), its computation is not affected if the layer horizon increases to 2. More generally, all layers up through are not affected by the layer horizon as long as the horizon is strictly greater than . In a model like gpt-2-xl with 48 layers, shifts in layer horizon are less meaningful for e.g. .

One way of controlling for this effect is to decide, “I’m going to test layer horizons up to 20”, and then only enforce those layer horizons on layers 20 and after. However, that design wouldn’t let us study the layer horizons of the layers before 20.

Black and white trout

Find out when I post more content: newsletter & RSSRSS icon

Thoughts? Email me at alex@turntrout.com

  1. For notational ease, I’m glossing over the fact that we’d be patching in different residual streams for each sublayer of layer . That is, we wouldn’t patch in the same activations for both the attention and mlp sublayers of layer .