https://arxiv.org/pdf/2301.08243.pdf

Screen Shot 2023-08-28 at 4.51.09 PM.png

The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block

Introduction

self supervised has two methods

<aside> 💡 explore how to improve the semantic level of self-supervised representations without using extra prior knowledge encoded through image transformations.

</aside>

Given single context block in abstract representation space, predict the target blocks around it. where target representations are learned in encoder network.

The neural network architecture employs an encoder, which transforms these raw data points into a more abstract form.