Mathematical Framework for Transformer Circuits

https://transformer-circuits.pub/2021/framework/index.html

<aside> 💡 How to understand or interpret transformers?

</aside>

We can conceptually think of transformers as a linear neural network, called residual path/stream, that contains a high dimensional subspace of vectors that represent the global transformer network.

Residual blocks is a diversion from the residual path, into something called the attention layer (encoder or decoder)

each attention layer has multiple heads which operate in parallel
Each attention layer also has a MLP that is added back into the residual stream.

Residual stream can just flow past a block.

stream serves as communication channel, updated certain dimensions for a attention layer and not updating others. kind of like memory or RAM

Virtual Weights

An especially useful consequence of the residual stream being linear is that one can think of implicit "virtual weights" directly connecting any pair of layers

Output Weights (e.g., ��1WO1): These weights are used to transform the output of a specific layer, like an attention layer or an MLP (multi-layer perceptron). In transformers, this might include the weights associated with generating the values in the self-attention mechanism or the weights in a feed-forward neural network (FFNN) layer that transforms the attention output.
Input Weights (e.g., ��2WI2): These are weights used to transform the input to a particular layer. In the context of transformers, they might represent the weights associated with generating the queries and keys in the attention mechanism or the weights at the beginning of an MLP that receives attention output.

<aside> 💡 impact of ONE layer can directly influence another, even if they are not adjacent to each other.

</aside>

In a deep network with residual connections, information from one layer can influence layers that are not directly adjacent. When the last layer of a residual block merges with the residual stream, the weights that have been applied to the data (output weights) can have cascading effects down the network. The input weights of a subsequent block are then applied to this mixed stream, potentially filtering and transforming the information.

The virtual weights represent this indirect influence between non-adjacent layers. They're not physical weights present in the network but rather a conceptual understanding of how information flows and interacts through the network's architecture.