DINO:
- Task: DINO is focused on self-supervised learning for vision tasks.
- Architecture: Does not have text representation learning.
- Learning Strategy: Uses a teacher-student paradigm with a contrastive loss, unlike the masked prediction in M3AE.
I-JEPA:
- Task: I-JEPA aims to align text and image patches.
- Architecture: Utilizes separate encoders for text and image.
- Learning Strategy: It focuses on patch alignment rather than masked token prediction.
Grad-CAM:
- Task: Grad-CAM is for visual explanations and is not directly a representation learning method.
- Architecture: Operates over existing trained networks to provide class-discriminative visualizations.
- Learning Strategy: No learning involved, used post-model training for interpretation.
CRATE:
- Task: CRATE aims for text-to-image synthesis.
- Architecture: Specifically designed for generating images from text descriptions.
- Learning Strategy: Uses a complex generator-discriminator setup and is not focused on representation learning.
CLIP:
- Task: CLIP is for zero-shot learning across vision and text.
- Architecture: Like M3AE, CLIP is multimodal but employs separate encoders.