Paper: Neural Tangent Kernel: Convergence and Generalization in Neural Networks" (Jacot et al., 2018)

Neural Tangent Kernel (NTK): Key Ideas

Paper: "Neural Tangent Kernel: Convergence and Generalization in Neural Networks" (Jacot et al., 2018)
Core Contribution: A theoretical framework to analyze the training dynamics of infinitely wide neural networks using kernel methods.

1. Intuition Behind NTK

At infinite width, a neural network behaves like a linear model in parameter space around its initialization.
The NTK is a kernel function that describes how small changes in parameters affect the network’s output during gradient descent.
Key Insight: Training dynamics of wide networks simplify to kernel regression with the NTK.

2. Mathematical Definition

For a neural network $f_{θ} (x)$ with parameters $θ$ :

The NTK $Θ (x, x^{'})$ is defined as:
$Θ (x, x^{'}) = E_{θ \sim init} ⟨ \nabla_{θ} f_{θ} (x), \nabla_{θ} f_{θ} (x^{'}) ⟩$
where $\nabla_{θ} f_{θ} (x)$ is the gradient (Jacobian) of the network’s output w.r.t. parameters.
At Infinite Width:
- The NTK becomes deterministic and stays constant during training (lazy training regime).
- Training reduces to solving:
  $f_{t} (x) \approx f_{0} (x) + Θ (x, \cdot)^{T} Θ^{- 1} (y - f_{0} (X))$
  where $f_{0}$ is the network at initialization.

3. Key Results

Convergence Guarantees:
- Infinitely wide networks trained with gradient descent converge to global minima if the NTK is positive definite.
Generalization:
- NTK theory explains why wide networks generalize despite overparameterization.
Linearized Training Dynamics:
- The network’s evolution can be approximated by:
  $\frac{d f_{t} (x)}{d t} \approx - η \sum_{i = 1}^{n} Θ (x, x_{i}) (f_{t} (x_{i}) - y_{i})$
  where $η$ is the learning rate.

4. Practical Implications

NTK Regime:
- Networks behave like linear models when width ≫ depth (e.g., wide ResNets, MLPs).
- Explains success of random feature models and shallow networks.
Beyond NTK:
- Finite-width networks deviate from NTK predictions (feature learning becomes important).
- Modern architectures (Transformers, GNNs) can also be analyzed with NTK extensions.

5. Limitations

Finite-Width Networks: NTK assumptions break down when width is not extreme.
Feature Learning: NTK ignores non-linear feature adaptation (critical in deep narrow networks).
Kernel Computability: Exact NTK is expensive to compute for large architectures.

6. Code Example (NTK Approximation)

import jax.numpy as jnp
from neural_tangents import stax

# Define an infinite-width network (linearized)
init_fn, apply_fn, kernel_fn = stax.serial(
    stax.Dense(512), stax.Relu(),
    stax.Dense(1)
# Compute NTK matrix
ntk = kernel_fn(X_train, X_train, 'ntk')

7. Follow-Up Work

NTK for Deep Networks: Arora et al., 2019
NTK and Generalization: Cao & Gu, 2019
Beyond NTK (Feature Learning): Yang & Hu, 2021

Why NTK Matters

Connects neural networks to classical kernel methods.
Provides theoretical guarantees for wide networks.
Inspires new optimization techniques (e.g., NTK-aware initialization).

For deeper analysis, see the original paper or the Neural Tangents library.

Link to paper: https://arxiv.org/pdf/1806.07572

Artificial Intelligence Theory and Application

Search This Blog

Paper: Neural Tangent Kernel: Convergence and Generalization in Neural Networks" (Jacot et al., 2018)

Neural Tangent Kernel (NTK): Key Ideas

1. Intuition Behind NTK

2. Mathematical Definition

3. Key Results

4. Practical Implications

5. Limitations

6. Code Example (NTK Approximation)

7. Follow-Up Work

Why NTK Matters

Comments

Post a Comment

Popular posts from this blog

Simple Linear Regression - and Related Regression Loss Functions

What problems can AI Neural Networks solve

Activation Functions in Neural Networks