From Book: VC dimension (Vapnik, 1998)

VC dimension (Vapnik-Chervonenkis dimension) is based on the foundational work by Vladimir Vapnik and Alexey Chervonenkis (1971, with key refinements in Vapnik's 1998 book "Statistical Learning Theory"):

1. What is VC Dimension?

The VC dimension is a measure of the capacity (complexity) of a hypothesis class (e.g., neural networks, decision trees). It quantifies the largest set of points that a model can shatter (classify correctly for all possible labelings).

Key Idea: A model with high VC dimension can fit more complex patterns but risks overfitting.
Formal Definition:
A hypothesis class $H$ has VC dimension $d$ if there exists a set of $d$ points that can be shattered by $H$ , but no set of $d + 1$ points can be shattered.

2. Intuitive Example

Example: Linear Classifiers in 2D

VC dimension = 3: Can shatter any 3 non-collinear points (all $2^{3} = 8$ labelings).
But fails for 4 points: No line can classify all $2^{4} = 16$ labelings of 4 points (e.g., XOR problem).

Linear classifiers shatter 3 points but not 4 (VC dim = 3).

3. Key Theorems

(1) Generalization Bound

Vapnik’s theorem links VC dimension to generalization error:

$Test Error \leq Train Error + O (\sqrt{\frac{d + \log (1 / δ)}{n}})$

where:

$d$ = VC dimension,
$n$ = sample size,
$δ$ = confidence level.

Interpretation:

Higher $d$ → Larger gap between train/test error (risk of overfitting).
To generalize well, ensure $n ≫ d$ .

(2) Sample Complexity

The number of samples needed for good generalization scales with VC dimension:

$n = Ω (\frac{d + \log (1 / δ)}{ϵ^{2}})$

where $ϵ$ is the desired error tolerance.

4. VC Dimension of Common Models

Model	VC Dimension
Linear classifiers (ℝᵈ)	$d + 1$
Neural Networks
Decision Trees	Grows with tree depth
SVM (RBF kernel)	Infinite (but still generalizes!)

5. Why VC Dimension Matters

Model Selection: Prefer models with lower VC dimension for small datasets.
Explains Overfitting: High VC dimension → Need more data to generalize.
Theoretical Foundation: Basis for PAC learning theory.

6. Limitations

Loose Bounds: VC bounds are often too pessimistic (e.g., SVMs generalize despite infinite VC dim).
Modern Extensions:
- Rademacher Complexity: Tighter bounds for today’s models.
- Implicit Bias: Gradient descent favors "simple" solutions even in high-capacity models.

7. Example: Calculating VC Dimension

Problem: Prove that linear classifiers in ℝ² have VC dimension 3.
Proof:

Shattering 3 points: For any 3 non-collinear points, all $2^{3} = 8$ labelings can be achieved by a line.
Cannot shatter 4 points: No line can classify the XOR labeling of 4 points.

8. Original Paper & Resources

Foundational Paper:
Vapnik & Chervonenkis (1971), "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities".
Modern Treatment:
Vapnik (1998), "Statistical Learning Theory" (Chapter 4).
Interactive Tutorial: VC Dimension Explained Visually.

Key Takeaway

VC dimension formalizes the tradeoff between model complexity and generalization, laying the groundwork for modern ML theory. While newer tools (e.g., Rademacher complexity) refine these ideas, VC theory remains fundamental.

VC Bounds: A Formal Explanation of Generalization

VC (Vapnik-Chervonenkis) bounds are theoretical guarantees that predict how well a machine learning model will generalize to unseen data, based on its VC dimension (a measure of model complexity). Introduced by Vapnik and Chervonenkis in the 1970s, these bounds connect model capacity, training data size, and generalization error.

1. Core Idea of VC Bounds

VC bounds quantify the worst-case gap between training error and test error for a hypothesis class $H$ . They show that:

Models with high VC dimension (complexity) need more data to generalize well.
Models with low VC dimension (simplicity) generalize even with less data.

2. Key VC Inequality

The most celebrated VC bound states that with probability $1 - δ$ :

$Test Error \leq Training Error + O (\sqrt{\frac{d + \log (1 / δ)}{n}})$

where:

$d$ = VC dimension of $H$ ,
$n$ = number of training samples,
$δ$ = confidence level (e.g., 0.05 for 95% confidence).

Interpretation:

The term $\sqrt{\frac{d}{n}}$ is the generalization gap.
If $d$ is large, you need $n ≫ d$ to keep the gap small.

3. Intuition Behind the Bound

Shattering: If a model can fit all possible labelings of $d$ points, its VC dimension is $\geq d$ .
Overfitting Risk: High VC dimension means the model can memorize noise, leading to poor generalization.
Sample Complexity: To ensure generalization, the number of samples $n$ must grow with $d$ .

Example:

A linear classifier in $R^{2}$ has VC dimension $d = 3$ .
If $n = 1000$ , the VC bound ensures the test error is close to training error.
If $n = 10$ , the bound becomes loose (high risk of overfitting).

4. VC Bound for Classification (Formal)

For a binary classifier $h \in H$ with VC dimension $d$ , with probability $1 - δ$ :

$Generalization Error \leq Empirical Error + \sqrt{\frac{d (\log (\frac{2 n}{d}) + 1) + \log (\frac{4}{δ})}{n}}$

This shows that:

The gap shrinks as $n \to \infty$ .
The gap grows with $d$ .

5. Practical Implications

(1) Model Selection

Prefer simpler models (low $d$ ) when data is scarce.
Deep learning paradox: Modern neural networks have huge $d$ but still generalize (due to implicit regularization).

(2) Sample Size Estimation

To achieve generalization error $ϵ$ , you need roughly:

$n \approx \frac{d}{ϵ^{2}}$

Example: If $d = 100$ and $ϵ = 0.1$ , $n \approx 10, ⁣ 000$ .

(3) Structural Risk Minimization (SRM)

Vapnik’s framework for model selection:

Choose a hierarchy of models $H_{1} \subset H_{2} \subset \dots$ with increasing $d$ .
Pick the model that minimizes:
$Training Error + VC Penalty Term$

6. Limitations of VC Bounds

Overly Pessimistic: Actual generalization is often better than VC bounds predict (e.g., deep nets).
Doesn’t Explain SGD: Modern training relies on optimization dynamics, not just hypothesis class.
Alternative Theories:
- Rademacher Complexity: Tighter bounds for specific data distributions.
- PAC-Bayes: Incorporates prior knowledge.

7. Example: VC Bound Calculation

Problem: A model has VC dimension $d = 50$ . How many samples $n$ are needed to ensure the generalization gap is $\leq 0.1$ with 95% confidence?

Solution:
Using the simplified bound $ϵ \approx \sqrt{\frac{d}{n}}$ :

$0.1 \geq \sqrt{\frac{50}{n}} ⟹ n \geq \frac{50}{0. 1^{2}} = 5, ⁣ 000$

8. Key Papers & Resources

Original VC Paper:
Vapnik & Chervonenkis (1971), "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities".
Modern Treatment:
Vapnik (1998), "Statistical Learning Theory" (Chapter 4).
Visual Guide: VC Dimension Explained.

Summary

VC bounds provide worst-case guarantees on generalization, showing that:
✅ Simpler models generalize better with limited data.
✅ Complex models need exponentially more data.
✅ Inspired modern theory (e.g., Rademacher complexity).

No Link as this is a book?

Link:

Simple Linear Regression - and Related Regression Loss Functions

Today's Topics: a. Regression Algorithms b. Outliers - Explained in Simple Terms c. Common Regression Metrics Explained d. Overfitting and Underfitting e. How are Linear and Non Linear Regression Algorithms used in Neural Networks [Future study topics] Regression Algorithms Regression algorithms are a category of machine learning methods used to predict a continuous numerical value. Linear regression is a simple, powerful, and interpretable algorithm for this type of problem. Quick Example: These are the scores of students vs. the hours they spent studying. Looking at this dataset of student scores and their corresponding study hours, can we determine what score someone might achieve after studying for a random number of hours? Example: From the graph, we can estimate that 4 hours of daily study would result in a score near 80. It is a simple example, but for more complex tasks the underlying concept will be similar. If you understand this graph, you will understand this blog. Sim...

Artificial Intelligence Theory and Application

Search This Blog