Yang Songs blog: https://yang-song.net/blog/2021/score/

Suppose we have some dataset $\{x_0,x_1,...x_N\}$ such that each point is drawn independently from an underlying data distribution p(x), where x is a datapoint and p(x) will return the probability of x appearing. The goal is to fit a model to the data distribution.

First need to represent a probability distribution.

  1. Model pdf directly. Let $f_\theta(x) \in \R$ be a real valued function parameterized by learnable parameter $\theta$. We can define a pdf

<aside> 💡 $p_\theta(x) = e^{-f_\theta(x)} / Z_\theta$

</aside>

Where Z is a normalizing constant . f(x) is an unormalized model or energy based model. The energy based model states datapoints with lower energy or unchangeable values. The function f assigns a real scalar value for every x in the input space.

Usually we dont know this target distribution ahead of time, so we approximate!

It's mathematically convenient, particularly when dealing with log-likelihoods, as the derivative of e^x is e^x simplifying many calculations.

We can train$p_\theta(x)$ by maximizing log likelihood of the SEEN data.

$$ \max_\theta\sum_{i=1}^N \log p_\theta(x) $$

We are finding the best parameters such that if we take the sum of all the datapoint’s log prob densities. We can maximizing the datas ability to be seen by adjusting parameters.

The parameterized distribution is meant to showcase the training or observed data first. We tkae this from Bayes approach:

$$ P(\theta | D) = \frac{P(D | \theta) * P(\theta)}{\int_\theta P(D|\theta)*P(\theta)d\theta} $$

The denom is like averaging the weighted by their likelihood and prior of EVERY parameter configuration.

However, equation 2 states that $p_\theta (x)$ must be a normalized constant. The max likelihood models must then restrict their architecture parameters.