Paper: Flat minima vs sharp minima (Keskar et al., 2017)

The paper "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima" by Keskar et al. (2017) investigates the relationship between batch size, optimization landscapes, and generalization in deep learning. A key contribution is the flat minima vs. sharp minima hypothesis, which explains why smaller batches often generalize better than larger ones.

Key Contributions of the Paper

Generalization Gap Observation:
- Large-batch training (e.g., batch size = 8192) often leads to worse test accuracy compared to small-batch training (e.g., batch size = 32), even when both achieve similar training loss.
- This is called the "generalization gap."
Flat vs. Sharp Minima Hypothesis:
- Small-batch methods tend to converge to flat minima (wide, low-curvature regions in the loss landscape).
- Large-batch methods tend to find sharp minima (narrow, high-curvature regions).
- Flat minima generalize better because they are more robust to input perturbations and noise.
Empirical Evidence:
- Measures the "sharpness" of minima using Hessian eigenvalues (larger eigenvalues → sharper minima).
- Shows that large-batch SGD converges to sharper minima, while small-batch SGD finds flatter ones.

Why Flat Minima Generalize Better?

Robustness to Perturbations:
- A flat minimum remains low even if parameters are slightly perturbed (good for unseen data).
- A sharp minimum can spike in loss with small changes (overfits to training data).
Implicit Regularization:
- Small-batch SGD has more noise (due to frequent updates), which helps escape sharp minima.
- Large-batch SGD is more deterministic and gets stuck in sharp basins.
Connection to Bayesian Inference:
- Flat minima can be seen as high-probability regions in the posterior distribution of parameters.

Experimental Results

Tested on ResNet, LSTM, and MLPs (CIFAR-10, PTB, MNIST).
Findings:
- Large batches (sharp minima):
  - Train fast but generalize poorly.
  - Higher Hessian eigenvalues (more curvature).
- Small batches (flat minima):
  - Train slower but generalize better.
  - Lower Hessian eigenvalues (smoother loss landscape).

Proposed Solutions for Large-Batch Training

Since large batches are desirable for parallel training, the paper suggests:

Warmup + Gradual Batch Increase (avoid early sharp minima).
Conservative Learning Rates (prevents aggressive convergence to sharp minima).
Sharpness-Aware Minimization (SAM) (later work, e.g., Foret et al., 2021).

Criticisms & Later Work

Sharpness Definitions Matter: Some argue that "flatness" should be measured relative to parameter re-scaling (Dinh et al., 2017).
Not All Flat Minima Generalize: Some flat minima can still overfit (e.g., if they span too many dimensions).
Alternative Views: Some papers suggest "volume of minima" (not just flatness) matters (Baldassi et al., 2020).

Code Example: Measuring Sharpness

import torch
import torch.nn.functional as F

def compute_sharpness(model, dataset, epsilon=1e-3):
    """Estimate sharpness via random perturbations."""
    original_loss = compute_loss(model, dataset)
    max_sharpness = 0.0
    
    for param in model.parameters():
        perturbation = torch.randn_like(param) * epsilon
        param.data += perturbation
        perturbed_loss = compute_loss(model, dataset)
        sharpness = abs(perturbed_loss - original_loss) / epsilon
        max_sharpness = max(max_sharpness, sharpness)
        param.data -= perturbation  # reset
        
    return max_sharpness

Key Takeaways

Small batches → Flat minima → Better generalization.
Large batches → Sharp minima → Faster training but worse generalization.
Solutions: Warmup, learning rate tuning, SAM optimizer.

This paper laid the foundation for understanding optimization dynamics in deep learning and inspired later work on sharpness-aware training. Would you like a deeper dive into SAM (Sharpness-Aware Minimization)? 😊

Link: https://arxiv.org/pdf/1609.04836

Artificial Intelligence Theory and Application

Search This Blog