Latent Diffusion Models: https://arxiv.org/pdf/2112.10752.pdf
[Stability.ai Stable video diffusion](https://richord.notion.site/Stability-ai-Stable-video-diffusion-9d8d640d710b4febaa612a2f7c598135)
CRATE
Diffusion Transformer: https://arxiv.org/pdf/2212.09748.pdf
3.2. Diffusion Transformer Design Space
Patchify
Cross-attention block
- Good ole’ attention block that inputs concatenated input embeddings t and c with multi-head, cross attention layer
AdaLN
- a type of normalization layer that uses Silu and linear layer that projects onto 6x the hidden dimensions, so we can chunk each representation
- Regresses the input embeddings timestep t and class c to get normalization parameters shift and scale
- This means that for each input, depending on its associated timestep and class, the normalization parameters can change, allowing the model to adapt its behavior more finely to the specifics of the input.
Zero-out: