Do Machine Learning Models Memorize or Generalize?

https://pair.withgoogle.com/explorables/grokking/

Frequency measure

Screen Shot 2023-08-11 at 2.29.09 PM.png

y axis is activation value: The activation value of a neuron is the result of applying an activation function (e.g., ReLU, sigmoid) to the weighted sum of its inputs plus a bias term. It's the "output" of the neuron that gets passed to the next layer.

x axis: the input value (modulo 67, so 0→ 66)

We are measuring activation pattern

Activation Pattern: This refers to the behavior of these activation values over time, across different inputs, or during the training process. For example, a neuron might consistently activate strongly for certain types of inputs and not at all for others, or it might oscillate between high and low activation values in a cyclical manner.

Each graph is at a certain frequency

If the activation pattern is cyclical and repeats "n" times, this frequency could manifest as "n" peaks in the graph, for example.
EX: Frequency 4 means for a neuron means there are 4 peaks

TAKEAWAY:

The more disassociated the neuron frequencies are, the more effort it is putting to memorize every detail of training data
As traning steps go on, model creates a mathematical structure that generalizes the

Generalizing 0s and 1s

Start with knowing the generalized solution, and try to understand why the model eventually learns it.

Start with a problem: with a binary sequence of 30 digits, return 1 if first three digits has odd number of 1s, 0 otherwise.

The goal achieved through generalizability is due to two factors: