https://www.youtube.com/watch?v=AKMuA_TVz3A&list=PLgKuh-lKre12qVTl88k2n2N37tT-BpmHT&index=4
Likelihood is the probability of an input data entry given the trained weights and parameters.
The higher likelihood is a function of the parameters and indicates compression is working.
The goal is to find the parameters that maximize the likelihood of the observed data, effectively finding the most efficient compression.
Theory of behavior can be explained about distribution of text and repeated patterns. GPT models are explained without alluding to the distrib.
Vision is a good domain: pixels have good unsupervised learning
iGPT does next pixel prediction, similar to transformers and autoregressive models
Compression theory does not explain why representations are clean and separable and linear.
Next autoregressive models have better predictions than BERT
Blue - BERT vs autoreg