Technical Paper: Attention Mechanism (Bahdanau et al., 2015)

Absolutely! The Attention Mechanism introduced by Bahdanau et al. in 2015 was a groundbreaking improvement to the original Seq2Seq model.

📄 Paper Title:

Neural Machine Translation by Jointly Learning to Align and Translate

Authors: Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio
Published: 2015
PDF Link:
👉 Read the paper here (arXiv PDF)

🚀 Why was Attention introduced?

Problem in basic Seq2Seq models (Sutskever et al., 2014):

The entire input sequence is compressed into one fixed-size context vector (last hidden state of the encoder RNN).
Information bottleneck problem:
- Especially for long sentences, the model struggles to retain all necessary information.
- Decoder depends heavily on one vector, regardless of input length.

💡 What is Attention Mechanism?

The core idea:

Instead of relying on a single context vector, allow the decoder to dynamically "attend" to different parts of the input sequence at each output step.

🔍 How does it work?

Architecture Overview:

Encoder:
- Processes input sequence → outputs hidden states for each input token: $h_1, h_2, h_3, \dots, h_T$
Decoder:
- Instead of taking just the final encoder state, at every decoding step, computes a weighted sum of all encoder hidden states → this is called the context vector.

Key Formula:

Score Function (Alignment Model):
Measures how well the decoder’s current state matches each encoder hidden state:
$e_{ij} = \text{score}(s_{i-1}, h_j)$
Where:
- $s_{i-1}$ = decoder's previous hidden state
- $h_j$ = encoder's hidden state at position $j$
Softmax over scores (Attention Weights):
$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T} \exp(e_{ik})}$
These weights tell how much attention to pay to each input token.
Context Vector:
$c_i = \sum_{j=1}^{T} \alpha_{ij} h_j$
This is fed into the decoder at each step, along with previous outputs.

Intuition:

For each output word, the model learns where to focus in the input sentence.
Mimics how humans read and translate—looking back and forth between input and output.

🌟 Impact:

Improved Translation Quality:
Handles long sentences and complex dependencies better.
Visual Interpretability:
You can visualize the attention weights → see which input words the model is focusing on while generating output.
Foundation for Transformers:
- The Bahdanau Attention Mechanism inspired Self-Attention, which is the core of Transformers (Vaswani et al., 2017).
- Transformers remove RNNs but retain the attention concept.

🔗 Quick Summary:

Without Attention	With Attention (Bahdanau)
One fixed context vector	Dynamic, weighted context at each step
Encoder hidden states compressed to one	Decoder attends to all encoder states
Bottleneck for long sentences	Handles long sequences better
Less interpretability	Can visualize attention weights

PDF Link Again:

📄 Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al., 2015)

The Attention Mechanism by Bahdanau et al. (2015), also called Additive Attention or Bahdanau Attention, was a game-changing improvement to the original Sequence-to-Sequence (Seq2Seq) architecture. Let me break down:

1. What Problem Did Bahdanau Attention Solve?

Issue with original Seq2Seq (Sutskever et al., 2014):

The entire input sequence is compressed into a single fixed-size context vector (the encoder's final hidden state).
This vector becomes a bottleneck, especially for long or complex sequences.
The decoder had limited access to the input sequence details.

Goal:

Allow the decoder to dynamically focus on different parts of the input sequence while generating each output token.

2. How Does Bahdanau Attention Work?

Step-by-Step:

Encoder outputs:
- Instead of keeping just the last hidden state, the encoder outputs all hidden states for each input token:
```
h1, h2, h3, ..., hn
```
At each decoder step:
- The decoder looks at all encoder hidden states and computes a score (alignment) for each one based on how relevant they are to the current decoding step.
Alignment Scores:
- These scores are computed using a small neural network that takes:
  - The decoder’s current hidden state.
  - Each encoder hidden state.
- Produces a score → How important is this encoder state?
Softmax:
- Convert scores into probabilities (attention weights) using softmax.
Context Vector:
- Multiply each encoder hidden state by its attention weight and sum them up.
- This weighted sum becomes the context vector passed to the decoder.
Decoder uses:
- Context vector + previous outputs → generates next token.

Visual:

Input:  [x1, x2, x3, ..., xn]
Encoder: h1, h2, h3, ..., hn (keep all hidden states)

For each output token (yt):
    Compute alignment scores between decoder state and all h1...hn
    Apply softmax → get attention weights
    Weighted sum → Context vector
    Use context vector + decoder hidden state → predict yt

3. How Did Bahdanau Attention Help?

✅ No more bottleneck!

The decoder doesn't rely on just one vector—it has access to all input positions dynamically.

✅ Better handling of long sentences.

✅ Improved performance in machine translation and other Seq2Seq tasks.

✅ Model learns to align input and output tokens.

4. How Did This Influence Transformers?

Bahdanau Attention directly inspired the Self-Attention mechanism in Transformers:

Bahdanau Attention (2015)	Transformer Attention (2017)
Computes attention between decoder state & encoder	Computes attention between all tokens, including self
Uses encoder-decoder alignment	Uses multi-head self-attention (more parallelizable)
Sequential, relies on RNNs	Fully parallel, no recurrence
Single attention mechanism	Multiple attention heads + layers → richer representations

Key Insight Carried Forward:

“Don’t rely on a single context vector; dynamically learn which parts of the sequence to focus on.”

Transformers generalized and scaled up attention:

Applied Self-Attention to both encoder and decoder.
Removed RNNs → making computation parallel.
Introduced Multi-Head Attention → attending to information in multiple subspaces.

5. Simplified Difference:

	Bahdanau Attention	Transformer Attention
Depends on	Decoder + Encoder hidden states	All tokens attend to each other
Core Idea	Learn alignment between input & output positions	Learn relationships between all tokens
Architecture	RNN-based (Sequential)	No recurrence, pure attention (Parallelizable)
Scaling	Harder for long sequences	Scales efficiently with large data

Summary:

Bahdanau's Attention introduced the concept of dynamically focusing on parts of input sequences, solving a major limitation of early Seq2Seq models. This concept evolved into the full self-attention mechanism in Transformers, making them far more powerful, scalable, and parallelizable.

Link: https://arxiv.org/pdf/1409.0473

Simple Linear Regression - and Related Regression Loss Functions

Today's Topics: a. Regression Algorithms b. Outliers - Explained in Simple Terms c. Common Regression Metrics Explained d. Overfitting and Underfitting e. How are Linear and Non Linear Regression Algorithms used in Neural Networks [Future study topics] Regression Algorithms Regression algorithms are a category of machine learning methods used to predict a continuous numerical value. Linear regression is a simple, powerful, and interpretable algorithm for this type of problem. Quick Example: These are the scores of students vs. the hours they spent studying. Looking at this dataset of student scores and their corresponding study hours, can we determine what score someone might achieve after studying for a random number of hours? Example: From the graph, we can estimate that 4 hours of daily study would result in a score near 80. It is a simple example, but for more complex tasks the underlying concept will be similar. If you understand this graph, you will understand this blog. Sim...

Artificial Intelligence Theory and Application

Search This Blog