A Mathematical Framework for Transformer Circuits

https://transformer-circuits.pub/2021/framework/index.html

Priveleged basis: when features are favored due to the architecture (ReLU activations)

The residual stream does not have priveleged basis. Transforming any matrix wont change the model behavior

Virtual Weights

Can be thought as connecting pairs of layers together. The output of the previous layer can be input of the next layer.
- Virtual weights =

SUBSPACES AND RESIDUAL STREAM BANDWIDTH

Small models have hundreds of dimensions, large models have tens of thousands.

The mappings from one dimension in a layer to another is good because information can be stored in different subspaces in a new layer. Then attention heads will receive new data in their subspaces

Information stays in a subspace unless actually deleted. Kind of like memory

Way more computational dimensions in neural network than a residual stream, so very compact.

Within a layer, there are MLP neurons or attention heads that act like memory manager roles. They can delete subspaces of memory

Attention Heads are Independent and Additive

Attention heads are individually stacked as the output of the encoder