Skip to main content

From Book: VC dimension (Vapnik, 1998)

VC dimension (Vapnik-Chervonenkis dimension) is based on the foundational work by Vladimir Vapnik and Alexey Chervonenkis (1971, with key refinements in Vapnik's 1998 book "Statistical Learning Theory"):


1. What is VC Dimension?

The VC dimension is a measure of the capacity (complexity) of a hypothesis class (e.g., neural networks, decision trees). It quantifies the largest set of points that a model can shatter (classify correctly for all possible labelings).

  • Key Idea: A model with high VC dimension can fit more complex patterns but risks overfitting.

  • Formal Definition:
    A hypothesis class H has VC dimension d if there exists a set of d points that can be shattered by H, but no set of d+1 points can be shattered.


2. Intuitive Example

Example: Linear Classifiers in 2D

  • VC dimension = 3: Can shatter any 3 non-collinear points (all 23=8 labelings).

  • But fails for 4 points: No line can classify all 24=16 labelings of 4 points (e.g., XOR problem).


Linear classifiers shatter 3 points but not 4 (VC dim = 3).


3. Key Theorems

(1) Generalization Bound

Vapnik’s theorem links VC dimension to generalization error:

Test ErrorTrain Error+O(d+log(1/δ)n)

where:

  • d = VC dimension,

  • n = sample size,

  • δ = confidence level.

Interpretation:

  • Higher d → Larger gap between train/test error (risk of overfitting).

  • To generalize well, ensure nd.

(2) Sample Complexity

The number of samples needed for good generalization scales with VC dimension:

n=Ω(d+log(1/δ)ϵ2)

where ϵ is the desired error tolerance.


4. VC Dimension of Common Models

ModelVC Dimension
Linear classifiers (ℝᵈ)d+1
Neural Networks
Decision TreesGrows with tree depth
SVM (RBF kernel)Infinite (but still generalizes!)

5. Why VC Dimension Matters

  • Model Selection: Prefer models with lower VC dimension for small datasets.

  • Explains Overfitting: High VC dimension → Need more data to generalize.

  • Theoretical Foundation: Basis for PAC learning theory.


6. Limitations

  • Loose Bounds: VC bounds are often too pessimistic (e.g., SVMs generalize despite infinite VC dim).

  • Modern Extensions:

    • Rademacher Complexity: Tighter bounds for today’s models.

    • Implicit Bias: Gradient descent favors "simple" solutions even in high-capacity models.


7. Example: Calculating VC Dimension

Problem: Prove that linear classifiers in ℝ² have VC dimension 3.
Proof:

  1. Shattering 3 points: For any 3 non-collinear points, all 23=8 labelings can be achieved by a line.

  2. Cannot shatter 4 points: No line can classify the XOR labeling of 4 points.


8. Original Paper & Resources

  • Foundational Paper:
    Vapnik & Chervonenkis (1971), "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities".

  • Modern Treatment:
    Vapnik (1998), "Statistical Learning Theory" (Chapter 4).

  • Interactive TutorialVC Dimension Explained Visually.


Key Takeaway

VC dimension formalizes the tradeoff between model complexity and generalization, laying the groundwork for modern ML theory. While newer tools (e.g., Rademacher complexity) refine these ideas, VC theory remains fundamental.


Related: 

VC Bounds: A Formal Explanation of Generalization

VC (Vapnik-Chervonenkis) bounds are theoretical guarantees that predict how well a machine learning model will generalize to unseen data, based on its VC dimension (a measure of model complexity). Introduced by Vapnik and Chervonenkis in the 1970s, these bounds connect model capacity, training data size, and generalization error.


1. Core Idea of VC Bounds

VC bounds quantify the worst-case gap between training error and test error for a hypothesis class H. They show that:

  • Models with high VC dimension (complexity) need more data to generalize well.

  • Models with low VC dimension (simplicity) generalize even with less data.


2. Key VC Inequality

The most celebrated VC bound states that with probability 1δ:

Test ErrorTraining Error+O(d+log(1/δ)n)

where:

  • d = VC dimension of H,

  • n = number of training samples,

  • δ = confidence level (e.g., 0.05 for 95% confidence).

Interpretation:

  • The term dn is the generalization gap.

  • If d is large, you need nd to keep the gap small.


3. Intuition Behind the Bound

  • Shattering: If a model can fit all possible labelings of d points, its VC dimension is d.

  • Overfitting Risk: High VC dimension means the model can memorize noise, leading to poor generalization.

  • Sample Complexity: To ensure generalization, the number of samples n must grow with d.

Example:

  • A linear classifier in R2 has VC dimension d=3.

  • If n=1000, the VC bound ensures the test error is close to training error.

  • If n=10, the bound becomes loose (high risk of overfitting).


4. VC Bound for Classification (Formal)

For a binary classifier hH with VC dimension d, with probability 1δ:

Generalization ErrorEmpirical Error+d(log(2nd)+1)+log(4δ)n

This shows that:

  • The gap shrinks as n.

  • The gap grows with d.


5. Practical Implications

(1) Model Selection

  • Prefer simpler models (low d) when data is scarce.

  • Deep learning paradox: Modern neural networks have huge d but still generalize (due to implicit regularization).

(2) Sample Size Estimation

To achieve generalization error ϵ, you need roughly:

ndϵ2

Example: If d=100 and ϵ=0.1n10, ⁣000.

(3) Structural Risk Minimization (SRM)

Vapnik’s framework for model selection:

  1. Choose a hierarchy of models H1H2 with increasing d.

  2. Pick the model that minimizes:

    Training Error+VC Penalty Term

6. Limitations of VC Bounds

  • Overly Pessimistic: Actual generalization is often better than VC bounds predict (e.g., deep nets).

  • Doesn’t Explain SGD: Modern training relies on optimization dynamics, not just hypothesis class.

  • Alternative Theories:

    • Rademacher Complexity: Tighter bounds for specific data distributions.

    • PAC-Bayes: Incorporates prior knowledge.


7. Example: VC Bound Calculation

Problem: A model has VC dimension d=50. How many samples n are needed to ensure the generalization gap is 0.1 with 95% confidence?

Solution:
Using the simplified bound ϵdn:

0.150n    n500.12=5, ⁣000


8. Key Papers & Resources

  • Original VC Paper:
    Vapnik & Chervonenkis (1971), "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities".

  • Modern Treatment:
    Vapnik (1998), "Statistical Learning Theory" (Chapter 4).

  • Visual GuideVC Dimension Explained.


Summary

VC bounds provide worst-case guarantees on generalization, showing that:
✅ Simpler models generalize better with limited data.
✅ Complex models need exponentially more data.
✅ Inspired modern theory (e.g., Rademacher complexity).


No Link as this is a book?

Link: 

Comments

Popular posts from this blog

Simple Linear Regression - and Related Regression Loss Functions

Today's Topics: a. Regression Algorithms  b. Outliers - Explained in Simple Terms c. Common Regression Metrics Explained d. Overfitting and Underfitting e. How are Linear and Non Linear Regression Algorithms used in Neural Networks [Future study topics] Regression Algorithms Regression algorithms are a category of machine learning methods used to predict a continuous numerical value. Linear regression is a simple, powerful, and interpretable algorithm for this type of problem. Quick Example: These are the scores of students vs. the hours they spent studying. Looking at this dataset of student scores and their corresponding study hours, can we determine what score someone might achieve after studying for a random number of hours? Example: From the graph, we can estimate that 4 hours of daily study would result in a score near 80. It is a simple example, but for more complex tasks the underlying concept will be similar. If you understand this graph, you will understand this blog. Sim...

What problems can AI Neural Networks solve

How does AI Neural Networks solve Problems? What problems can AI Neural Networks solve? Based on effectiveness and common usage, here's the ranking from best to least suitable for neural networks (Classification Problems, Regression Problems and Optimization Problems.) But first some Math, background and related topics as how the Neural Network Learn by training (Supervised Learning and Unsupervised Learning.)  Background Note - Mathematical Precision vs. Practical AI Solutions. Math can solve all these problems with very accurate results. While Math can theoretically solve classification, regression, and optimization problems with perfect accuracy, such calculations often require impractical amounts of time—hours, days, or even years for complex real-world scenarios. In practice, we rarely need absolute precision; instead, we need actionable results quickly enough to make timely decisions. Neural networks excel at this trade-off, providing "good enough" solutions in seco...

Activation Functions in Neural Networks

  A Guide to Activation Functions in Neural Networks 🧠 Question: Without activation function can a neural network with many layers be non-linear? Answer: Provided at the end of this document. Activation functions are a crucial component of neural networks. Their primary purpose is to introduce non-linearity , which allows the network to learn the complex, winding patterns found in real-world data. Without them, a neural network, no matter how deep, would just be a simple linear model. In the diagram below the f is the activation function that receives input and send output to next layers. Commonly used activation functions. 1. Sigmoid Function 2. Tanh (Hyperbolic Tangent) 3. ReLU (Rectified Linear Unit - Like an Electronic Diode) 4. Leaky ReLU & PReLU 5. ELU (Exponential Linear Unit) 6. Softmax 7. GELU, Swish, and SiLU 1. Sigmoid Function                       The classic "S-curve," Sigmoid squashes any input value t...