Likelihood is the probability of an input data entry given the trained weights and parameters.

The higher likelihood is a function of the parameters and indicates compression is working.

The goal is to find the parameters that maximize the likelihood of the observed data, effectively finding the most efficient compression.

GPT

Theory of behavior can be explained about distribution of text and repeated patterns. GPT models are explained without alluding to the distrib.

Vision is a good domain: pixels have good unsupervised learning

iGPT does next pixel prediction, similar to transformers and autoregressive models

Linear Representations

Compression theory does not explain why representations are clean and separable and linear.

Next autoregressive models have better predictions than BERT

Blue - BERT vs autoreg