ReduNet- A White-box Deep Network from the Principle ofMaximizing Rate Reduction

https://jmlr.org/papers/volume23/21-0631/21-0631.pdf TLDR: lack of explicit and direct connection between such algorithms and the objective. Mostly has been empirical and trial and error (how deep the network should be?). There is a lack of principles The authors' approach is to view the goal of a deep network as learning an LDR. This means mapping the high-dimensional input data into a lower-dimensional space in such a way that different classes of inputs are easily distinguishable. They argue that this perspective can provide insights into the functioning of deep networks.

how to develop a principled mathematical framework for better understanding and design of deep networks?

A New theoretical framework based on data compression and representation

Go back to investigate the data and how it should influence how neural networks work

What intrinsic structures should we derive from the data? What is a good objective function that determines if a structure is a good representation?
Can the current neural network architecture be justified?

a principled objective for a deep network is to learn a low-dimensional linear discriminative representation of the data

Objective function can be evaluating the data compression or Rate reduction.

Traditional Deep Learning: Cross Entropy Loss learning

2 serious limitations

Aims to predict labels y even if they are mislabeled
Has to interpreat the intermediate features learned by the network. Intrinsic structure is not meaningful exhibit a neural collapsing phenomenon: structural information is getting ignored, losing lots of representation and often obscuring interpretability

One way to interpret is view outputs as latent low dimensional features that are highly discriminative (can be deciding which class or category it belongs to). Then next we usually send it through a normalization layer to determine the label y based on a probability distribution. The Information Bottleneck refers to learning latent features with minimal statistics or overall view of the data

These neural nets can decay due to [[Interpretability and Alignment Library#^482740|neural collapse]] Reducing transferability: Because the learned features are so specific to the original task, they might not be useful for different tasks. If you wanted to use the same network to identify different types of apples, it might perform poorly because it hasn't learned to pay attention to the relevant features. Sacrificing robustness: If the labels (e.g., "apple" or "orange") used during training can be corrupted or subject to noise, the network might still latch onto the same, now incorrect, features. It's not robust to errors or changes in the labeling process.

Reconciling contractive and contrastive learning.

Auto-encoding is good to learn latent representations. z is latent representation after encoding the raw input x, and the learning process is guided by heuristics that encourage certain properties in latent representation.

Contractive auto-encoder (Rifai et al., 2011) is used that penalizes local dimension expansion if the input is unrelated to it.

In simpler terms, this means that the model is encouraged to learn a representation where similar inputs have similar representations. This can help the model to focus on the most important, high-level features of the data, which can be beneficial when we want to decode these representations back into the original data space.

Data can be complicated by multi modal low dimensional structures (data where features are low dimensional and high impact for causality, but have different groups/modals that are meaningfully represented). Naive heuristics may arise (not enough dimensions to remember some other factor). Simpler methods wont be able to catch finer grained detail that can determine which category.

Mixed modal structures may emerge, where representations may mix representing different groups. Might need to increase dimensions. 1.Mode collapsing: This is a common problem with GANs where the generator produces limited varieties of samples, or even the same sample, regardless of the input noise vector. Essentially, different 'modes' (peaks in the probability distribution of the data) collapse into a single mode. For example, if you're training a GAN to generate images of digits, and it only generates images of the number 8, ignoring other digits (0-7, 9), that's mode collapse.