https://arxiv.org/pdf/2312.01597.pdf
GOAL: improving segmentation of CLIP
It uses a new module that replaces self-attention block in the CLIP vision encoder.
- focuses on Spatial-Invariant Features
- Spatial invariance refers to the property of a model where its response to a specific feature is the same regardless of its position in the input space. In simpler terms, if a model is spatially invariant, it will recognize a feature (like an edge, a pattern, or a specific object) no matter where it appears in the image.
- Models with spatial-invariant features are designed to detect certain features or patterns uniformly across the entire image. This means the model abstracts away the positional information and focuses on the presence or absence of specific features.
- Uses Correlative Self Attention
- computes attention scores by pairwise correlations across local tokens