self-supervised network able to learn unlabeled image data to perform object segmentation

DINO (DIstill your Neural network Operator) is considered a self-supervised learning algorithm. While it uses a teacher-student setup, it is different from traditional supervised learning because it doesn't require any human-annotated labels.

The teacher model produces representations or embeddings of the input images. These are not human-annotated labels but rather generated data based on the current knowledge of the teacher model.

DINO is not directly predicting any predefined labels. Instead, it is trying to make the student model's representations similar to the teacher's representations for the same input. In essence, DINO aims to train the student network so that it produces similar embeddings as the teacher network when fed with the same input. The main idea is to minimize the divergence between the teacher and student embeddings for the same visual data.

Self distillation / SSL

momentum teacher

<aside> 💡 Why does the student learn local and global while teacher only learns global view?

Data

Local Views: Small crops, in which only <50% of the data

Global Views: Large crops, in which only >50% of the data is used

All crops are sent to student, only Global Views are sent to the teacher! So that the student can correct and generalize from not focusing too much on detail.