<aside> 💡

• Semantic segmentation information refers to the ability to group pixels of an image into meaningful classes. CRATE discovers these segmentation properties that will allow us

</aside>

<aside> 💡 Unrolling refers to the process of expanding a recurrent or iterative computational graph into a feedforward network, where each iteration becomes a layer in the network.

Local Signal Model: The low-dimensional subspace approximations at each layer â„“ which represent the distribution of input tokens *Z_*â„“. Linearization: The process of simplifying a complex, possibly nonlinear function into a linear approximation. Global Scale: Refers to the entire span of the model, encompassing all layers.

</aside>

Ideas of segmentation, Vision transformers,

The paper aims to investigate whether the segmentation capabilities observed in ViTs trained with self-supervised methods like DINO are solely due to the training method or if they can also emerge through proper architectural design in supervised settings.

Segmentation is emergent property from self supervised mechanisms

The passage is saying that vision transformers trained with the DINO self-supervised method seem to implicitly learn how to segment images into semantic categories, without ever being trained explicitly on pixel-level segmentation labeling.

To promote segmentation properties is to use a white box transformer architecture with the input data in mind.

WhiteBox Vision Transformers

<aside> 💡 TLDR; transforms input data in a friendly manner, by contantly iterating on input data to convert to linear compact feature representation

</aside>

some function transform input data that is multimodal and nonlinear to linearized and compact feature representations in lower dimensional space.

<aside> 💡

After the weighted computation, the token distribution in Z is modified to create a new representation that effectively aligns with the multiple subspaces U1, …. UK.

This makes the data easier to work with or analyze because they are now better represented by a set of "basis vectors" in multiple subspaces.

</aside>

Local signal model asserts tokens are contained in serveral low dimensional subspaces.

tokens z are outputs of the local signal model. They represent image patches

This optimizes sparse rate reduction: