Dictionaries (img=224, dim=768)
D^T * D * x: sparisfies x then reconstructs back into x dimension.
https://github.com/Ma-Lab-Berkeley/CRATE/blob/787bcb0c339c49928ecc18b39473ad02b27f90f1/model/crate.py#L31
D^T * x: Converts x into sparse representation
https://github.com/Ma-Lab-Berkeley/CRATE/blob/787bcb0c339c49928ecc18b39473ad02b27f90f1/model/crate.py#L34
Take the difference
It's essentially measuring the difference between the original data x and its reconstruction D^T * D * x in the new basis.
so its basically like doing this reconstruction exercise to make sure the dictionary will effectively convert x back into original form with minimal information loss. That is the difference. Then this will be updated for some future training data to nudge it more until we find a robust parameter trained on lots of data. Then we are able to convert x into a sparse representation that minimizes the information loss
In the context of ISTA and LASSO, the dictionary �D is a matrix that helps in transforming the original data �x into a sparse representation. The goal is to find a representation of �x that is as sparse as possible while still being accurate. This is particularly useful in applications like compressed sensing, image reconstruction, and feature selection, where you want to represent data using fewer dimensions or features.
Attention maps
<aside> 💡 Our results suggest that the crate model encodes a clear semantic segmentation of each image in the n
</aside>
The self-attention map essentially tells us how each "token" (or patch, in the case of images) in the input sequence is attending to every other token. It's a measure of the relationships between different parts of the image. For example, if a particular patch is strongly attending to another patch, it means that these two patches are in some way "related" according to the model. This relationship could be semantic, like two patches both being part of the same object, or it could be based on some other learned criterion.