<aside> 💡

• Semantic segmentation information refers to the ability to group pixels of an image into meaningful classes. CRATE discovers these segmentation properties that will allow us

</aside>

<aside> 💡 Unrolling refers to the process of expanding a recurrent or iterative computational graph into a feedforward network, where each iteration becomes a layer in the network.

Local Signal Model: The low-dimensional subspace approximations at each layer ℓ which represent the distribution of input tokens *Z_*ℓ. Linearization: The process of simplifying a complex, possibly nonlinear function into a linear approximation. Global Scale: Refers to the entire span of the model, encompassing all layers.

</aside>

Ideas of segmentation, Vision transformers,

The paper aims to investigate whether the segmentation capabilities observed in ViTs trained with self-supervised methods like DINO are solely due to the training method or if they can also emerge through proper architectural design in supervised settings.

Segmentation is emergent property from self supervised mechanisms

The passage is saying that vision transformers trained with the DINO self-supervised method seem to implicitly learn how to segment images into semantic categories, without ever being trained explicitly on pixel-level segmentation labeling.

To promote segmentation properties is to use a white box transformer architecture with the input data in mind.

WhiteBox Vision Transformers

<aside> 💡 TLDR; transforms input data in a friendly manner, by contantly iterating on input data to convert to linear compact feature representation

its an empirical experiment, so there is no formal motivation of why this works

</aside>

some function transform input data that is multimodal and nonlinear to linearized and compact feature representations in lower dimensional space.

<aside> 💡

After the weighted computation, the token distribution in Z is modified to create a new representation that effectively aligns with the multiple subspaces U1, …. UK.

This makes the data easier to work with or analyze because they are now better represented by a set of "basis vectors" in multiple subspaces.

</aside>

Local signal model asserts tokens are contained in serveral low dimensional subspaces.

tokens z are outputs of the local signal model. They represent image patches

This optimizes sparse rate reduction:

information that is compressed keeps the most important things while removing other pieces.
Instead of performing calculations in a loop, where the output of one step feeds into the next, each step is laid out in sequence. These sequences become layers in the model, making the process easier to analyze and optimize.