Problem

CLIP is a multimodal model that embeds images and text into an embedding space. The language component is “bottlenecked”when it is forced to share a space with images, often limiting expressive capability.

A language model trained to understand nuances in text may not be able to capture those nuances when its outputs have to be directly comparable to image features.

Vision is less controllable, model is defined on weights trained by images and text

Definitions

Solution/Approaches

  1. Extended CLIP: alignment is not just between a full image and prompt, but between individual words and sections (or tokens) of the image.
    1. Build transformer?
  2. Experimentation: taking CLIP features (before being smushed into a vector), training a model on top of these that we hypothesize does vision/language binding better, and seeing how it (hopefully) improves stable diffusion.
    1. Take existing CLIP and finetune the parameters?
    2. adding an additional layer or model that operates on the features extracted by the original CLIP model. This additional layer is trained to perform specific tasks that improve the model's vision-language binding. By features do you mean the weights at the output layer?

Misc Details

Meetings!

Meetings

Why Research this

Screen Shot 2023-08-30 at 5.10.19 PM.png

Benchmarks show that CLIP is a dual stream that has 30.75% acc compared to random chance (25%)