<aside> 💡

Attention Pooling: A method to create a weighted summary of features using attention mechanisms.
Query Conditioning: Modifying or influencing the query based on additional context (in this case, the global average-pooled representation of the image) to guide the attention process in a specific direction.
- "query is conditioned on the global average-pooled representation of the image," it means that the query vector used in the attention mechanism is modified or generated based on the global average-pooled representation of the image.
- can take into account the overall context of the image when determining which specific features to pay attention to.
Computer Vision is a field of study that involves enabling machines to interpret and make decisions based on visual data.
Natural language supervision refers to the process of using textual information as guidance, hints, or labels during the training of machine learning models. Natural language supervision enables more flexible and human-like learning, bridging the gap between vision and language. Linear projection: applying a transformation to the data using a weight matrix and possibly a bias term.
multi-modal embedding space: refers to a mathematical space where different types of data (modalities), such as text and images, are represented in a shared and compatible way.
Residual Networks, or ResNets, are a type of neural network architecture that introduced the concept of "residual connections" or "skip connections." These connections allow the output from one layer to bypass one or more intermediate layers and be summed with the output of later layers. This architecture helps mitigate the vanishing gradient problem and enables the training of very deep networks.
Vision Transformers apply the transformer architecture, which was initially designed for natural language processing tasks, to computer vision problems. Unlike convolutional networks like ResNets, ViTs divide an image into small patches and process them as a sequence, utilizing self-attention mechanisms to capture the relationships between different parts of the image.
Task learning capability refers to the ability of a machine learning model to learn, understand, and perform specific tasks. In the context of zero-shot transfer, task learning capability emphasizes the model's ability to generalize from its training experience to new, unseen tasks without requiring additional training specific to those tasks.
natural distribution shifts. These are images that have variation that occurs in real world scenarios. </aside>

Abstract

Usually computer vision image classification is based on a predetermined set of classifications.

Learning directly from raw image is thoughtful discussion, especially around less supervision and more unsupervised training without labels.

Introduction

New autoregressive models and text to text models becoming increasingly performant and powerful. When using downstream datasets (task specific data), zero shot prompting improves performance, without the need for output heads (no need to change architecture, can just test zero shot prompt and see its generalizability)

CV is still common to pretrain using ImageNet or some benchmark dataset

Could pretraining methods by direct raw text learning be helpful?

Computer Vision (CV) using direct learning typically involves leveraging unstructured data such as text, images, and videos found on the web to train models without heavy manual annotation or curation.

<aside> 💡 Direct learning from web text in the context of computer vision refers to the idea of utilizing freely available and abundant textual information on the internet to guide the learning of visual concepts.

This means we would be learning images through text

</aside>

Natural language assisted or supervised learning is rare for CV (image representation learning).

This work limits classes to 1000 and 18291. Meaning the rest is not labeled and must be discovered using raw free text

A crucial difference between these weakly supervised models and recent explorations of learning image representations directly from natural language is scale

CLIP, for Contrastive Language-Image Pre-training, is an efficient method/model of learning from natural language supervision

This work uses natural language supervision to train a model at scale. This allows it to learn new contexts and relationships of text to images, but it could also be tricky and biased.