<aside> 💡

Abstract

Usually computer vision image classification is based on a predetermined set of classifications.

Learning directly from raw image is thoughtful discussion, especially around less supervision and more unsupervised training without labels.

Introduction

New autoregressive models and text to text models becoming increasingly performant and powerful. When using downstream datasets (task specific data), zero shot prompting improves performance, without the need for output heads (no need to change architecture, can just test zero shot prompt and see its generalizability)

CV is still common to pretrain using ImageNet or some benchmark dataset

Could pretraining methods by direct raw text learning be helpful?

Computer Vision (CV) using direct learning typically involves leveraging unstructured data such as text, images, and videos found on the web to train models without heavy manual annotation or curation.

<aside> 💡 Direct learning from web text in the context of computer vision refers to the idea of utilizing freely available and abundant textual information on the internet to guide the learning of visual concepts.

This means we would be learning images through text

</aside>

Natural language assisted or supervised learning is rare for CV (image representation learning).

This work limits classes to 1000 and 18291. Meaning the rest is not labeled and must be discovered using raw free text

A crucial difference between these weakly supervised models and recent explorations of learning image representations directly from natural language is scale

CLIP, for Contrastive Language-Image Pre-training, is an efficient method/model of learning from natural language supervision

This work uses natural language supervision to train a model at scale. This allows it to learn new contexts and relationships of text to images, but it could also be tricky and biased.