Skip to main content

Technical Paper: Attention Mechanism (Bahdanau et al., 2015)

Absolutely! The Attention Mechanism introduced by Bahdanau et al. in 2015 was a groundbreaking improvement to the original Seq2Seq model.


📄 Paper Title:

Neural Machine Translation by Jointly Learning to Align and Translate


🚀 Why was Attention introduced?

Problem in basic Seq2Seq models (Sutskever et al., 2014):

  • The entire input sequence is compressed into one fixed-size context vector (last hidden state of the encoder RNN).
  • Information bottleneck problem:
    • Especially for long sentences, the model struggles to retain all necessary information.
    • Decoder depends heavily on one vector, regardless of input length.

💡 What is Attention Mechanism?

The core idea:

Instead of relying on a single context vector, allow the decoder to dynamically "attend" to different parts of the input sequence at each output step.


🔍 How does it work?

Architecture Overview:

  1. Encoder:

    • Processes input sequence → outputs hidden states for each input token: h1,h2,h3,,hTh_1, h_2, h_3, \dots, h_T
  2. Decoder:

    • Instead of taking just the final encoder state, at every decoding step, computes a weighted sum of all encoder hidden states → this is called the context vector.

Key Formula:

  1. Score Function (Alignment Model):
    Measures how well the decoder’s current state matches each encoder hidden state:

    eij=score(si1,hj)e_{ij} = \text{score}(s_{i-1}, h_j)

    Where:

    • si1s_{i-1} = decoder's previous hidden state
    • hjh_j = encoder's hidden state at position jj
  2. Softmax over scores (Attention Weights):

    αij=exp(eij)k=1Texp(eik)\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T} \exp(e_{ik})}

    These weights tell how much attention to pay to each input token.

  3. Context Vector:

    ci=j=1Tαijhjc_i = \sum_{j=1}^{T} \alpha_{ij} h_j

    This is fed into the decoder at each step, along with previous outputs.


Intuition:

  • For each output word, the model learns where to focus in the input sentence.
  • Mimics how humans read and translate—looking back and forth between input and output.

🌟 Impact:

  1. Improved Translation Quality:
    Handles long sentences and complex dependencies better.

  2. Visual Interpretability:
    You can visualize the attention weights → see which input words the model is focusing on while generating output.

  3. Foundation for Transformers:

    • The Bahdanau Attention Mechanism inspired Self-Attention, which is the core of Transformers (Vaswani et al., 2017).
    • Transformers remove RNNs but retain the attention concept.

🔗 Quick Summary:

Without Attention With Attention (Bahdanau)
One fixed context vector Dynamic, weighted context at each step
Encoder hidden states compressed to one Decoder attends to all encoder states
Bottleneck for long sentences Handles long sequences better
Less interpretability Can visualize attention weights

PDF Link Again:

📄 Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al., 2015)

The Attention Mechanism by Bahdanau et al. (2015), also called Additive Attention or Bahdanau Attention, was a game-changing improvement to the original Sequence-to-Sequence (Seq2Seq) architecture. Let me break down:


1. What Problem Did Bahdanau Attention Solve?

Issue with original Seq2Seq (Sutskever et al., 2014):

  • The entire input sequence is compressed into a single fixed-size context vector (the encoder's final hidden state).
  • This vector becomes a bottleneck, especially for long or complex sequences.
  • The decoder had limited access to the input sequence details.

Goal:

Allow the decoder to dynamically focus on different parts of the input sequence while generating each output token.


2. How Does Bahdanau Attention Work?

Step-by-Step:

  1. Encoder outputs:

    • Instead of keeping just the last hidden state, the encoder outputs all hidden states for each input token:
      h1, h2, h3, ..., hn
      
  2. At each decoder step:

    • The decoder looks at all encoder hidden states and computes a score (alignment) for each one based on how relevant they are to the current decoding step.
  3. Alignment Scores:

    • These scores are computed using a small neural network that takes:
      • The decoder’s current hidden state.
      • Each encoder hidden state.
    • Produces a score → How important is this encoder state?
  4. Softmax:

    • Convert scores into probabilities (attention weights) using softmax.
  5. Context Vector:

    • Multiply each encoder hidden state by its attention weight and sum them up.
    • This weighted sum becomes the context vector passed to the decoder.
  6. Decoder uses:

    • Context vector + previous outputs → generates next token.

Visual:

Input:  [x1, x2, x3, ..., xn]
Encoder: h1, h2, h3, ..., hn (keep all hidden states)

For each output token (yt):
    Compute alignment scores between decoder state and all h1...hn
    Apply softmax → get attention weights
    Weighted sum → Context vector
    Use context vector + decoder hidden state → predict yt

3. How Did Bahdanau Attention Help?

No more bottleneck!

  • The decoder doesn't rely on just one vector—it has access to all input positions dynamically.

Better handling of long sentences.

Improved performance in machine translation and other Seq2Seq tasks.

Model learns to align input and output tokens.


4. How Did This Influence Transformers?

Bahdanau Attention directly inspired the Self-Attention mechanism in Transformers:

Bahdanau Attention (2015) Transformer Attention (2017)
Computes attention between decoder state & encoder Computes attention between all tokens, including self
Uses encoder-decoder alignment Uses multi-head self-attention (more parallelizable)
Sequential, relies on RNNs Fully parallel, no recurrence
Single attention mechanism Multiple attention heads + layers → richer representations

Key Insight Carried Forward:

“Don’t rely on a single context vector; dynamically learn which parts of the sequence to focus on.”

Transformers generalized and scaled up attention:

  • Applied Self-Attention to both encoder and decoder.
  • Removed RNNs → making computation parallel.
  • Introduced Multi-Head Attention → attending to information in multiple subspaces.

5. Simplified Difference:

Bahdanau Attention Transformer Attention
Depends on Decoder + Encoder hidden states All tokens attend to each other
Core Idea Learn alignment between input & output positions Learn relationships between all tokens
Architecture RNN-based (Sequential) No recurrence, pure attention (Parallelizable)
Scaling Harder for long sequences Scales efficiently with large data

Summary:

Bahdanau's Attention introduced the concept of dynamically focusing on parts of input sequences, solving a major limitation of early Seq2Seq models. This concept evolved into the full self-attention mechanism in Transformers, making them far more powerful, scalable, and parallelizable.


Link: https://arxiv.org/pdf/1409.0473

Comments

Popular posts from this blog

Simple Linear Regression - and Related Regression Loss Functions

Today's Topics: a. Regression Algorithms  b. Outliers - Explained in Simple Terms c. Common Regression Metrics Explained d. Overfitting and Underfitting e. How are Linear and Non Linear Regression Algorithms used in Neural Networks [Future study topics] Regression Algorithms Regression algorithms are a category of machine learning methods used to predict a continuous numerical value. Linear regression is a simple, powerful, and interpretable algorithm for this type of problem. Quick Example: These are the scores of students vs. the hours they spent studying. Looking at this dataset of student scores and their corresponding study hours, can we determine what score someone might achieve after studying for a random number of hours? Example: From the graph, we can estimate that 4 hours of daily study would result in a score near 80. It is a simple example, but for more complex tasks the underlying concept will be similar. If you understand this graph, you will understand this blog. Sim...

What problems can AI Neural Networks solve

How does AI Neural Networks solve Problems? What problems can AI Neural Networks solve? Based on effectiveness and common usage, here's the ranking from best to least suitable for neural networks (Classification Problems, Regression Problems and Optimization Problems.) But first some Math, background and related topics as how the Neural Network Learn by training (Supervised Learning and Unsupervised Learning.)  Background Note - Mathematical Precision vs. Practical AI Solutions. Math can solve all these problems with very accurate results. While Math can theoretically solve classification, regression, and optimization problems with perfect accuracy, such calculations often require impractical amounts of time—hours, days, or even years for complex real-world scenarios. In practice, we rarely need absolute precision; instead, we need actionable results quickly enough to make timely decisions. Neural networks excel at this trade-off, providing "good enough" solutions in seco...

Activation Functions in Neural Networks

  A Guide to Activation Functions in Neural Networks 🧠 Question: Without activation function can a neural network with many layers be non-linear? Answer: Provided at the end of this document. Activation functions are a crucial component of neural networks. Their primary purpose is to introduce non-linearity , which allows the network to learn the complex, winding patterns found in real-world data. Without them, a neural network, no matter how deep, would just be a simple linear model. In the diagram below the f is the activation function that receives input and send output to next layers. Commonly used activation functions. 1. Sigmoid Function 2. Tanh (Hyperbolic Tangent) 3. ReLU (Rectified Linear Unit - Like an Electronic Diode) 4. Leaky ReLU & PReLU 5. ELU (Exponential Linear Unit) 6. Softmax 7. GELU, Swish, and SiLU 1. Sigmoid Function                       The classic "S-curve," Sigmoid squashes any input value t...