Skip to main content

Technical Paper: A Closer Look at Memorization in Deep Networks

Paper: "A Closer Look at Memorization in Deep Networks"

Authors: Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, Nathan Srebro
Year: 2019


1. Motivation: Why Study Memorization?

Deep neural networks are known to memorize their training data:

  • They can perfectly fit random labels.
  • Yet, generalize well on real data.

Key Question:
Does memorization interfere with generalization? How do deep networks balance both?

This paper investigates:

  • How neural networks memorize.
  • Whether memorization is necessary for generalization.
  • How memorization relates to network depth, architecture, and optimization.

2. Key Contributions

The authors make several key empirical and theoretical observations:

1. Memorization happens only when needed.

  • Networks prefer to learn patterns and structure first.
  • Memorization kicks in only when no other option exists (e.g., when labels are random).

2. Progressive Learning Behavior

  • On structured data:
    • "Easy" patterns are learned first.
    • Hard-to-learn or noisy labels are memorized later during training.
  • Training dynamics matter: early stopping can prevent overfitting noise.

3. Generalization doesn't rely on memorization.

  • Memorization capability is present but not used unless forced.
  • Over-parameterization allows both memorization and generalization.

4. Depth improves memorization speed.

  • Deeper networks memorize random labels faster.
  • Suggests depth contributes to capacity.

5. Different datasets exhibit different memorization behaviors.

  • CIFAR-10 vs. random labels: different rates and stages of memorization.

3. Methodology

The authors conduct careful experiments on:

  • Real datasets: CIFAR-10, SVHN.
  • Corrupted datasets: Random labels, shuffled data.

What they measure:

  • Per-example loss trajectories: Track how individual examples’ losses evolve over time.
  • Compare:
    • Clean labels vs. Noisy/random labels.
    • Different network architectures: shallow vs. deep.

Observation tools:

  • Sorting examples by ease of learning.
  • Measuring memorization speed.

4. Key Experimental Findings (Detailed)

A. Training on Real vs. Random Labels

  • On real datasets:
    • Loss decreases rapidly.
    • Generalization improves early.
  • On random labels:
    • Loss decreases slowly.
    • Needs more epochs → clear memorization behavior.

B. Per-example Loss Analysis

  • Easy-to-learn examples:
    • Loss decreases very quickly.
    • Typically represent structured patterns.
  • Hard/noisy examples:
    • Loss decreases much later.

    • Reflect memorization of noise.

C. Depth Analysis

  • Deeper networks memorize faster.
  • Depth increases capacity → but generalization doesn’t degrade.

5. Theoretical Insight

While much of the paper is empirical, it connects to theoretical discussions on capacity:

  • Over-parameterized networks can represent any dataset, including random labels.
  • But gradient-based optimization biases learning toward structured patterns first (related to implicit bias in optimization).

6. Implications

Observation Implication
Networks memorize noise only if forced to. Networks inherently prefer structure → ties to inductive biases of neural nets.
Memorization occurs late in training. Early stopping can prevent overfitting noise.
Depth accelerates memorization. Depth increases capacity, but generalization is preserved.
Different datasets behave differently. Dataset structure and label noise levels impact memorization-generalization trade-offs.

7. Relation to Other Work

This paper builds on prior findings:

  • Zhang et al. (2017): "Understanding Deep Learning Requires Rethinking Generalization" → showed deep nets fit random labels.

But Neyshabur et al. go further:

  • Analyze when and how memorization happens.
  • Disentangle memorization from generalization.

8. Practical Takeaways

  • Early stopping & regularization prevent unwanted memorization.
  • Memorization is not inherently bad but doesn’t help generalization.
  • Depth should not be feared; it helps capacity but does not harm generalization if used properly.

9. Limitations & Future Directions

  • Mostly empirical.
  • Doesn't fully explain why deep networks prioritize structured data first (later research investigates the implicit bias of gradient descent).

Summary Table

Aspect Finding
Training behavior Easy patterns first, noisy labels last.
Random labels Networks can memorize them but do so slowly.
Depth Speeds up memorization, but generalization holds.
Optimization dynamics Key to understanding memorization; early stopping helps generalization.
Dataset structure Affects ease & order of memorization.

Technical Paper Link: https://arxiv.org/pdf/1706.05394

Comments

Popular posts from this blog

Simple Linear Regression - and Related Regression Loss Functions

Today's Topics: a. Regression Algorithms  b. Outliers - Explained in Simple Terms c. Common Regression Metrics Explained d. Overfitting and Underfitting e. How are Linear and Non Linear Regression Algorithms used in Neural Networks [Future study topics] Regression Algorithms Regression algorithms are a category of machine learning methods used to predict a continuous numerical value. Linear regression is a simple, powerful, and interpretable algorithm for this type of problem. Quick Example: These are the scores of students vs. the hours they spent studying. Looking at this dataset of student scores and their corresponding study hours, can we determine what score someone might achieve after studying for a random number of hours? Example: From the graph, we can estimate that 4 hours of daily study would result in a score near 80. It is a simple example, but for more complex tasks the underlying concept will be similar. If you understand this graph, you will understand this blog. Sim...

What problems can AI Neural Networks solve

How does AI Neural Networks solve Problems? What problems can AI Neural Networks solve? Based on effectiveness and common usage, here's the ranking from best to least suitable for neural networks (Classification Problems, Regression Problems and Optimization Problems.) But first some Math, background and related topics as how the Neural Network Learn by training (Supervised Learning and Unsupervised Learning.)  Background Note - Mathematical Precision vs. Practical AI Solutions. Math can solve all these problems with very accurate results. While Math can theoretically solve classification, regression, and optimization problems with perfect accuracy, such calculations often require impractical amounts of time—hours, days, or even years for complex real-world scenarios. In practice, we rarely need absolute precision; instead, we need actionable results quickly enough to make timely decisions. Neural networks excel at this trade-off, providing "good enough" solutions in seco...

Activation Functions in Neural Networks

  A Guide to Activation Functions in Neural Networks 🧠 Question: Without activation function can a neural network with many layers be non-linear? Answer: Provided at the end of this document. Activation functions are a crucial component of neural networks. Their primary purpose is to introduce non-linearity , which allows the network to learn the complex, winding patterns found in real-world data. Without them, a neural network, no matter how deep, would just be a simple linear model. In the diagram below the f is the activation function that receives input and send output to next layers. Commonly used activation functions. 1. Sigmoid Function 2. Tanh (Hyperbolic Tangent) 3. ReLU (Rectified Linear Unit - Like an Electronic Diode) 4. Leaky ReLU & PReLU 5. ELU (Exponential Linear Unit) 6. Softmax 7. GELU, Swish, and SiLU 1. Sigmoid Function                       The classic "S-curve," Sigmoid squashes any input value t...