https://transformer-circuits.pub/2021/framework/index.html
Priveleged basis: when features are favored due to the architecture (ReLU activations)
The residual stream does not have priveleged basis. Transforming any matrix wont change the model behavior
Small models have hundreds of dimensions, large models have tens of thousands.
The mappings from one dimension in a layer to another is good because information can be stored in different subspaces in a new layer. Then attention heads will receive new data in their subspaces
Information stays in a subspace unless actually deleted. Kind of like memory
Way more computational dimensions in neural network than a residual stream, so very compact.
Within a layer, there are MLP neurons or attention heads that act like memory manager roles. They can delete subspaces of memory
Attention heads are individually stacked as the output of the encoder