https://arxiv.org/pdf/2301.08243.pdf
The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block
- Target blocks:
- context blocks: A selected region of the image that serves as the basis for making predictions about other parts (target blocks) of the same image.
- We can infer neighobring images from the selected block of image
- highly semantic: High-level features that capture the meaning or content in the image, not just colors or edges.
- sampel target blocks have great semantic value (able to relate with many other parts of the image)
- Spatially distributed: information that is spaced throughout the image, providing robust global representation of an image.
- Masking Strategy: The method used for selectively ignoring certain parts of the data during training, which helps the model focus on relevant portions.
- Masking lets you pick the ones that are important to look at.
Introduction
self supervised has two methods
- invariance encoding
- Take two views of same image and compare them with high semantic value using encoder
- Produces iimages of high semantic value, exceeds objective detail of an image.
- generative
- remove or corrupt learn to predict corrupted content
- uses masking to patch the holes in an image, uses this task to train.
- This allows the model to generalize and remove any spurrious detail inside a mask
<aside>
💡 explore how to improve the semantic level of self-supervised representations without using extra prior knowledge encoded through image transformations.
</aside>
Given single context block in abstract representation space, predict the target blocks around it. where target representations are learned in encoder network.
The neural network architecture employs an encoder, which transforms these raw data points into a more abstract form.
- This makes the features more comparable and easier to manipulate.