Differences

Input Representation:
- MLP: Typically, text would be represented using bag-of-words (BoW) or TF-IDF. Each input would be a fixed-size vector representing word occurrences or word importance. Contextual relationship between words was largely ignored.
- Transformers: Use embeddings (like word embeddings or subword embeddings) and also consider the order of the words using positional embeddings. This allows the model to understand both the meaning of individual words and their order.
Architecture:
- MLP: Made up of fully connected layers where each neuron in a layer connects to every neuron in the next layer. There's no inherent sense of order or sequence in MLPs.
- Transformers: Utilize self-attention mechanisms to weigh the importance of different words in a sequence relative to a particular word. This allows them to handle variable-length sequences and capture long-range dependencies in the data.
Context Understanding:
- MLP: Due to their architecture and the common input representations (like BoW), MLPs lack the ability to understand the context over sequences of words.
- Transformers: Designed specifically to handle sequences and context. The self-attention mechanism allows them to focus on different parts of the input text to understand context, making them especially suited for tasks like translation where the relationship between words can be crucial.
Training and Complexity:
- MLP: Training is straightforward using backpropagation. However, as the network depth increases, they are prone to issues like vanishing or exploding gradients.
- Transformers: Training is more complex due to the attention mechanisms, leading to increased computational demands. Techniques like layer normalization are integral to stabilize training.
Flexibility and Adaptability:
- MLP: Primarily suited for fixed-size input and output. For sequential data, Recurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTM) networks were used.
- Transformers: Highly flexible and have been adapted for a variety of tasks in NLP, from translation to summarization to classification. Their architecture inherently handles sequences, making them a go-to for many NLP challenges.

Code

The crucial difference is in MLP code:

The line embcat = emb.view(emb.shape[0], -1) reshapes the embedded tokens, effectively concatenating them. This is how your model takes in a sequence of embeddings.
By doing this, the model treats each position in the sequence as a separate feature. This means that while the embeddings capture the individual meanings of the tokens, the model does not inherently understand the sequential nature or relationships between them.
In comparison to architectures like RNNs or Transformers, this method lacks the ability to capture longer-range dependencies or the true sequential context, as the order of tokens is flattened into a single vector.

Hidden Layers and Activation:

The model has a single hidden layer with a tanh activation. This is a typical MLP construction. The weights W1 and biases help the model learn patterns and relationships from the input features.
The inputs are fixed and trained and learned from that relationship only.
Transformers do have a Full-Feed-Forward Neural Network that is trained similar to a MLP with weights and full connections between layers. But they have extra layers which store the context of familiar sequences of tokens which activate them (Query, Key, Value, Positional embeddings). There is not fixed familiarity of a sequence of words.

Traditional MLP & Context:
- In traditional MLP models like the one you shared, we treat sequences as a bag of tokens. The order can matter, but only in the sense that each position in the sequence becomes a separate feature. There's no inherent mechanism to understand the relationships or dependencies between tokens. They're fed to the model as a long vector of concatenated embeddings, so any pattern the MLP picks up regarding order or relationships between tokens has to be learned from the ground up during training.
Transformers & Context:
- Transformers approach sequences in a fundamentally different way. Every token is given a chance to 'interact' with every other token, allowing the model to understand relationships and dependencies across the entire sequence.
- The self-attention mechanism is key to this. By calculating attention scores for each token with respect to every other token, the model can weigh the importance of different parts of the sequence when considering a particular token.
- This is complemented by the position embeddings, which give the model information about the order of tokens.
Granular vs. Holistic Context:
- When we talk about context in the realm of NLP, it can be at different levels: a. Token-Level Context: This refers to the relationships and dependencies between individual tokens in a sequence. Transformers excel here, as they can capture intricate patterns and relationships across different parts of a sequence. b. Block/Sequence-Level Context: This refers to understanding entire sequences or blocks of text as units. RNNs, for example, process sequences token by token, accumulating context as they go.
In Essence:
- Transformers allow for a more granular, detailed understanding of context at the token level, whereas traditional MLPs or even some sequence models might deal with context at a more holistic, block-level.