https://www.youtube.com/watch?v=MUvFuZpxLU8&list=PLgKuh-lKre12qVTl88k2n2N37tT-BpmHT&index=7
Is sampling and model determine the functional forms? For example are exponents for data and model scaling ever the same?
- functional form represents how well the data from the sample will fit the model. The approximation
- different math equations offer different functional forms: linear, quadratic.
- We select a functional form based on the assumptions of the data.
Scaling regimes the same?
- A taxonomy exists that classifies different scaling motivations (compute, parameters, etc) that have different mechanistic origins that classify differently?
Is there universal, generalizable behvior, or is the whole problem too dependent on microscopic details?
Approach and test simple theory
We want to learn a model with parameters 0 with data distribution p(x,y)
- classic supervised setting
Loss is function of amount of data and number of model params
Architectures
- layers (deep) L
- parameters (width) N