Problem
CLIP is a multimodal model that embeds images and text into an embedding space. The language component is “bottlenecked”when it is forced to share a space with images, often limiting expressive capability.
A language model trained to understand nuances in text may not be able to capture those nuances when its outputs have to be directly comparable to image features.
Vision is less controllable, model is defined on weights trained by images and text
Definitions
Solution/Approaches
- Extended CLIP: alignment is not just between a full image and prompt, but between individual words and sections (or tokens) of the image.
- Build transformer?
- Experimentation: taking CLIP features (before being smushed into a vector), training a model on top of these that we hypothesize does vision/language binding better, and seeing how it (hopefully) improves stable diffusion.
- Take existing CLIP and finetune the parameters?
- adding an additional layer or model that operates on the features extracted by the original CLIP model. This additional layer is trained to perform specific tasks that improve the model's vision-language binding. By features do you mean the weights at the output layer?
Misc Details
Meetings!
Meetings
Why Research this
Benchmarks show that CLIP is a dual stream that has 30.75% acc compared to random chance (25%)