https://transformer-circuits.pub/2021/framework/index.html
<aside> 💡 How to understand or interpret transformers?
</aside>
We can conceptually think of transformers as a linear neural network, called residual path/stream, that contains a high dimensional subspace of vectors that represent the global transformer network.
Residual blocks is a diversion from the residual path, into something called the attention layer (encoder or decoder)
Residual stream can just flow past a block.
An especially useful consequence of the residual stream being linear is that one can think of implicit "virtual weights" directly connecting any pair of layers
<aside> 💡 impact of ONE layer can directly influence another, even if they are not adjacent to each other.
</aside>
In a deep network with residual connections, information from one layer can influence layers that are not directly adjacent. When the last layer of a residual block merges with the residual stream, the weights that have been applied to the data (output weights) can have cascading effects down the network. The input weights of a subsequent block are then applied to this mixed stream, potentially filtering and transforming the information.
The virtual weights represent this indirect influence between non-adjacent layers. They're not physical weights present in the network but rather a conceptual understanding of how information flows and interacts through the network's architecture.