DINO | Notion

self-supervised network able to learn unlabeled image data to perform object segmentation

DINO (DIstill your Neural network Operator) is considered a self-supervised learning algorithm. While it uses a teacher-student setup, it is different from traditional supervised learning because it doesn't require any human-annotated labels.

The teacher model produces representations or embeddings of the input images. These are not human-annotated labels but rather generated data based on the current knowledge of the teacher model.

DINO is not directly predicting any predefined labels. Instead, it is trying to make the student model's representations similar to the teacher's representations for the same input. In essence, DINO aims to train the student network so that it produces similar embeddings as the teacher network when fed with the same input. The main idea is to minimize the divergence between the teacher and student embeddings for the same visual data.

Self distillation / SSL

two networks
teacher, student Vision Transformer

momentum teacher

it’s weights are an exponentially weighted average of the student’s
Definition of Mode Collapse: In machine learning, mode collapse happens when a model starts to generate or represent only a specific subset of the possible outcomes, neglecting others. This can lead to a lack of diversity and over-generalization.
Intuitive Explanation: Imagine the student and teacher models are the same. Without the momentum, they might get stuck, always producing the same output. By introducing the momentum teacher, the model retains some "memory" of past states, helping to prevent this collapse.

<aside> 💡 Why does the student learn local and global while teacher only learns global view?

Knowledge Distillation: The setup facilitates knowledge distillation, where the student learns from the teacher. The teacher, with its broader perspective, guides the student in recognizing the essential features, while the student's attention to details ensures that fine-grained information is also captured.
Self-Supervised Learning: Since DINO is a self-supervised approach, it doesn't rely on labeled data. The combination of global and local views helps the model generate meaningful contrasts and comparisons within the data, facilitating the learning of robust representations.
Preventing Trivial Solutions: By having the student observe both global and local views and the teacher only the global, the learning process becomes more challenging and prevents the student from mimicking the teacher without truly understanding the data. This ensures that the learning is meaningful and non-trivial. </aside>

Data

Local Views: Small crops, in which only <50% of the data

Global Views: Large crops, in which only >50% of the data is used

All crops are sent to student, only Global Views are sent to the teacher! So that the student can correct and generalize from not focusing too much on detail.