M3AE | Notion

Abstract

DINO:

Task: DINO is focused on self-supervised learning for vision tasks.
Architecture: Does not have text representation learning.
Learning Strategy: Uses a teacher-student paradigm with a contrastive loss, unlike the masked prediction in M3AE.

I-JEPA:

Task: I-JEPA aims to align text and image patches.
Architecture: Utilizes separate encoders for text and image.
Learning Strategy: It focuses on patch alignment rather than masked token prediction.

Grad-CAM:

Task: Grad-CAM is for visual explanations and is not directly a representation learning method.
Architecture: Operates over existing trained networks to provide class-discriminative visualizations.
Learning Strategy: No learning involved, used post-model training for interpretation.

CRATE:

Task: CRATE aims for text-to-image synthesis.
Architecture: Specifically designed for generating images from text descriptions.
Learning Strategy: Uses a complex generator-discriminator setup and is not focused on representation learning.

CLIP:

Task: CLIP is for zero-shot learning across vision and text.
Architecture: Like M3AE, CLIP is multimodal but employs separate encoders.