3.2. Diffusion Transformer Design Space

Patchify

Cross-attention block

Good ole’ attention block that inputs concatenated input embeddings t and c with multi-head, cross attention layer

AdaLN

a type of normalization layer that uses Silu and linear layer that projects onto 6x the hidden dimensions, so we can chunk each representation
Regresses the input embeddings timestep t and class c to get normalization parameters shift and scale
This means that for each input, depending on its associated timestep and class, the normalization parameters can change, allowing the model to adapt its behavior more finely to the specifics of the input.

Zero-out: