https://arxiv.org/pdf/2312.01597.pdf

GOAL: improving segmentation of CLIP

It uses a new module that replaces self-attention block in the CLIP vision encoder.