<aside> 💡 The goal is to generate slot-dynamic architecture that can segment a variable number of objects in the image.
Currently: fixed slot papers that successfully segment an expected same fixed number of objects. Given an image with different object count, it will fail to segment (counting two for one object or missing an object)
Find what parts of architecture that will lead us to utilize slot identifiability of objects, while also deemphasizing the need to rely on a slot to segment an object.
</aside>
humans are likely governed by simple sets of learning, on their own let us to methodical counting.
Given some symbol like 1, extrapolate concepts of counting from it. MAE or DINO, vision-language models like M3AE
setting up a framework for image text pairs and get it representations we want. simple API here is image and here is text.
Find emergent properties that enable methodical counting
Propose using object centric representation learning via slot attention. We like this because each slot can segment an image. we push decoder with disentangling the representation into objects
mapping text to images? how can we integrate CLIP into this?
Want to disentangle them. Well first start with slots maybe map each text to a token or factor of varation .
Count anything with text guidance
https://arxiv.org/pdf/2305.07304.pdf
Can identify a bunch of objects of a specific class that is given (e.g carrots)
Does not use patch annotations (because its zero shot)
Uses patch localization and including each patch (convert to text dimension) with similarity with text.
Finetunes by adding weights after transformer as visual prompts.