Skip to main content

Regular R² vs Adjusted R²

Understanding R-squared (The Coefficient of Determination) 


What Does R² Measure?

tells you the proportion of the variance in the target variable that your model can explain [metrics to calculate the difference of predicted and expected values]. It provides a score between 0 and 1, though it can be negative for very poor models.

  • : A perfect model. It explains 100% of the variability in the data.

  • : A useless model. It performs no better than a baseline model that simply predicts the average of the target variable.

  • : A very poor model. It performs worse than just predicting the average. This can happen when evaluating the model on new, unseen data.

R-squared () and the coefficient of determination are two names for the exact same statistical measure. It's one of the most common metrics used to evaluate how well a regression model fits the data.

Why Two Names?

  • "Coefficient of Determination" is the formal statistical term. It accurately describes what the metric does: it determines the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

  • "R-squared" or "" is the common name and mathematical notation. The "R" comes from its relationship with Pearson's correlation coefficient (r). In a simple linear regression with one variable, R2 is literally the square of Pearson's r ().

In practice, the terms are used interchangeably. "R-squared" is common among practitioners for its brevity, while "coefficient of determination" is often used in formal academic papers.


The Math Behind R-squared 

The formula for is a ratio of how much variance the model explains versus the total variance in the data.

The Formula

Sum of Squared Residuals/Total Sum of Squares from Mean)

  • SSres​ (Sum of Squared Residuals [Residuals are Errors = expected minus predicted value at each point]): This is the error of your model. It's the sum of the squared differences between the actual values (yi​) and your model's predicted values (ŷ​i​).

  • SStot​ (Total Sum of Squares from "Mean"): This represents the total variance in the data. It's the sum of the squared differences between the actual values (yi​) and the mean of all actual values (ȳ​).


ȳ (y with a bar/dash on top) is spelled "y-bar" and represents the MEAN or average of the observed values
ŷ (y with a hat/caret on top) is spelled "y-hat" and represents the predicted/fitted values

So in conversation, you'd say:

"y-bar" for the mean
"y-hat" for the predictions

Why "Sum of Squares"?

The term "sum of squares" is literal. To measure variation, we can't just sum the differences from the mean (e.g., ȳ), because positive and negative differences would cancel each other out.

Solution: We square each difference to make it positive. Squaring also has the benefit of heavily penalizing larger errors. The "Total Sum of Squares" is the sum of the areas of these squares.


Regular R² vs. Adjusted R² 

While standard is useful, it has a critical flaw: it always increases as you add more variables to the model, even if those new variables are completely useless. This can be misleading and encourage overfitting.

Adjusted solves this problem by adding a penalty for each new variable included in the model.

The Problem Illustrated

Imagine predicting a house price with progressively more variables:

ModelVariables AddedRegular R²Adjusted R²
Model 1Square footage0.700.69
Model 2+ Number of bedrooms0.750.73
Model 3+ Zip code0.800.77
Model 4+ Owner's birthday0.810.76 
Model 5+ Favorite color0.820.74 

Notice how Regular R² keeps rising, while Adjusted R² starts to drop when we add irrelevant "nonsense" variables, correctly signaling that the model is becoming unnecessarily complex.

The Formulas Compared

ComponentRegular R²Adjusted R²
Formula
MeaningMeasures raw explanatory power.Balances explanatory power against model complexity.

Here, n is the number of data points (samples), and p is the number of predictors (features) in the model. The term (n-1)/(n-p-1) acts as the penalty factor.

When to Use Each

  • Use Regular R² for simple linear regression (one variable) or when comparing models with the same number of variables.

  • Use Adjusted R² for multiple regression or whenever you're comparing models with a different number of variables. It's essential for model selection and protecting against overfitting.

Quick Decision Rule: Think of Regular R² as the raw test score and Adjusted R² as the score after a "curve" that accounts for the difficulty (complexity). For choosing the best model, you almost always want to use Adjusted R².


Appendix: Pearson's Correlation Coefficient (r)

Pearson's r is a measure that quantifies the strength and direction of a linear relationship between two continuous variables. Its value is always between -1 and +1.

  • r = +1: Perfect positive linear relationship.

  • r = 0: No linear relationship.

  • r = -1: Perfect perfect negative linear relationship.

In simple linear regression, the connection is direct: . For example, if the correlation (r) between study hours and exam scores is +0.8, the R-squared (R2) would be 0.64, meaning that study hours explain 64% of the variance in exam scores.

More information (Same concept different words.)

R² (R-squared) and the coefficient of determination are exactly the same thing. They are just two different names for the identical statistical measure.

Why Two Names?

"Coefficient of determination" is the formal statistical term that describes what the measure actually does - it determines how much of the variance is explained

"R-squared" or "R²" is the mathematical notation, where the "R" comes from the correlation coefficient (Pearson's r), and squaring it gives us this measure

The Connection

The relationship becomes clearer when you consider:

In simple linear regression, R² literally equals the square of the Pearson correlation coefficient (r) between predicted and actual values

Hence: R² = r²

Common Usage

In practice, you'll see both terms used interchangeably:

Academic papers might use "coefficient of determination" for formal precision

Data scientists and practitioners often just say "R-squared" for brevity

Documentation might write "R² (coefficient of determination)" to be clear

So when you see either term in the context of neural networks or any regression analysis, they're referring to the same metric that measures the proportion of variance explained by the model.

R² (R-squared) in the context of neural networks is a statistical measure that indicates how well the model's predictions match the actual data. It's borrowed from traditional statistics and represents the coefficient of determination.

What R² Measures

R² tells you the proportion of variance in the target variable that your neural network can explain. It ranges from 0 to 1 (though can be negative for very poor models):

R² = 1: Perfect prediction - the model explains all variability in the target

R² = 0: The model performs no better than simply predicting the mean

R² < 0: The model performs worse than predicting the mean (possible with test data)

Mathematical Definition

R² is calculated as:

R² = 1 - (SS_res / SS_tot)

Where:

SS_res (residual sum of squares) = Σ(y_actual - y_predicted)²

SS_tot (total sum of squares) = Σ(y_actual - y_mean)²

Use in Neural Networks

In neural networks, R² is primarily used for:

Regression tasks - It's most appropriate when your network outputs continuous values

Model evaluation - Comparing how well different architectures or hyperparameters perform

Interpretability - Providing a more intuitive metric than raw loss values (MSE, MAE)


Important Considerations

While R² is useful, it has limitations in neural network contexts:

It's not suitable for classification tasks (use accuracy, F1-score, etc. instead)

High R² doesn't necessarily mean your model generalizes well (overfitting can inflate R²)

For complex, non-linear relationships that neural networks often model, R² might not capture all aspects of model performance

Unlike simpler models, adding parameters to neural networks doesn't automatically increase R²


In practice, R² serves as one metric among several for evaluating regression neural networks, particularly useful when you need to communicate model performance to stakeholders familiar with traditional statistical measures.

The regular R² is the simpler, original formula:

R² = 1 - (SS_res / SS_tot)

Or equivalently:

R² = (SS_explained / SS_tot)

Where:

  • SS_res = Σ(y - ŷ)² = Sum of squared residuals (errors)
  • SS_tot = Σ(y - ȳ)² = Total sum of squares from mean
  • SS_explained = SS_tot - SS_res = Variance explained by model

What We're Actually Squaring

We're squaring the distance of each data point from the mean:

Total Sum of Squares = Σ(y - ȳ)²
                          ↑
                    This gets squared!

Step-by-Step Example

Let's say you have test scores: 70, 80, 90

Step 1: Find the mean

ȳ = (70 + 80 + 90) / 3 = 80

Step 2: Find each difference from mean

Student 1: 70 - 80 = -10
Student 2: 80 - 80 = 0
Student 3: 90 - 80 = +10

Step 3: Square each difference

Student 1: (-10)² = 100  ← This is a "square"
Student 2: (0)² = 0       ← This is a "square"
Student 3: (+10)² = 100   ← This is a "square"

Step 4: Sum all the squares

Total Sum of Squares = 100 + 0 + 100 = 200

Why Do We Square?

Problem Without Squaring:

Differences: -10, 0, +10
Sum: -10 + 0 + 10 = 0  ← Cancels out!

The negative and positive differences cancel each other, suggesting no variance when there clearly is!

Solution With Squaring:

Squared differences: 100, 0, 100
Sum: 100 + 0 + 100 = 200  ← Shows actual spread!

Visual Representation

Imagine each squared difference as an actual square:

Student 1 (70):          Student 3 (90):
┌──────────┐             ┌──────────┐
│          │             │          │
│   100    │ 10×10       │   100    │ 10×10
│          │             │          │
└──────────┘             └──────────┘
     ↑                          ↑
  Area = (-10)²              Area = 10²

Student 2 (80):
• (no square, difference = 0)

The "Total Sum of Squares" is literally the sum of all these square areas!

Why "Squares" Instead of Absolute Values?

We could use absolute values: |y - ȳ|

But squaring has advantages:

  1. Mathematical: Derivatives are easier (important for optimization)
  2. Statistical: Links to variance and standard deviation
  3. Penalizes outliers: Large errors get extra weight
    • Difference of 2: squared = 4
    • Difference of 10: squared = 100 (25× more penalty!)

In Context of R²

SS_tot = Σ(y - ȳ)²     = Total squared distances from mean
SS_res = Σ(y - ŷ)²     = Total squared distances from predictions
SS_explained = SS_tot - SS_res = Variance explained by model

R² = SS_explained/SS_tot = Proportion of "squares" explained

Real-World Analogy

Think of it like measuring how "wrong" each guess is:

  • Small miss (off by 2): Penalty = 4
  • Medium miss (off by 5): Penalty = 25
  • Big miss (off by 10): Penalty = 100

The "sum of squares" is your total penalty score. The model's job is to minimize this penalty!

The Name's Origin

The term comes from early statistics (early 1900s) when calculations were done by hand. Statisticians would literally:

  1. Calculate differences
  2. Square them (multiply by themselves)
  3. Sum up all these squared values

Hence: "Sum of Squares" = Adding up all the squared differences

So when you hear "Total Sum of Squares," think: "Total amount of squared variation in the data" - it's measuring how spread out your data is from its average! 


Regular R² vs Adjusted R²

R² and Adjusted R² - Side-by-Side Comparison

R² (R-squared)

R² = 1 - (SSres/SStot)

R² = 1 - [Σ(yi - ŷi)²] / [Σ(yi - ȳ)²]

Adjusted R²

R²adj = 1 - [(1 - R²) × (n - 1) / (n - p - 1)]

R²adj = 1 - [(SSres/(n - p - 1)) / (SStot/(n - 1))]

Breaking Down the Components

Component Symbol Meaning
SSres Σ(yi - ŷi)² Sum of squared residuals (errors)
SStot Σ(yi - ȳ)² Total sum of squares from "mean" (total variance)
n n Number of observations/samples
p p Number of predictors/features (excluding intercept)
yi yi Actual value
ŷi ŷi Predicted value
ȳ ȳ Mean of actual values

Alternative Form - Showing the Relationship

Starting from R²:

R² = 1 - (SSres/SStot)

Adjusted R² modifies this by adding degrees of freedom:

R²adj = 1 - [(SSres/SStot) × (n - 1)/(n - p - 1)]

Which can be rewritten as:

R²adj = 1 - [(1 - R²) × (n - 1)/(n - p - 1)]

Key Mathematical Differences

Penalty Term

The adjustment factor is: (n - 1)/(n - p - 1)

  • When p = 0 (no predictors): R²adj = R²
  • As p increases: The denominator (n - p - 1) decreases, making the penalty larger
  • The penalty becomes more severe with smaller sample sizes

Numerical Example

Let's say:

  • R² = 0.80
  • n = 100 samples
  • p = 5 predictors

Calculating R²:

R² = 0.80 (given)

Calculating Adjusted R²:

R²adj = 1 - [(1 - 0.80) × (100 - 1)/(100 - 5 - 1)]
R²adj = 1 - [0.20 × 99/94]
R²adj = 1 - [0.20 × 1.0532]
R²adj = 1 - 0.2106
R²adj = 0.7894

Why Adjusted R² < R²

From the formulas, we can see:

  • The factor (n - 1)/(n - p - 1) is always > 1 when p > 0
  • This multiplies the error term (1 - R²)
  • Therefore, adjusted R² always penalizes for additional predictors

When to Use Each

Use R² when:

  • Comparing models with the same number of predictors
  • Working with simple models
  • You want the raw explanatory power

Use Adjusted R² when:

  • Comparing models with different numbers of predictors
  • Concerned about overfitting
  • Need to account for model complexity
  • Feature selection decisions

The fundamental difference is that Adjusted R² includes a penalty for model complexity, making it more suitable for model selection in neural networks and machine learning where we often have many parameters.

In other words,

Regular R² (Coefficient of Determination)

The regular R² is the simpler, original formula:

R² = 1 - (SS_res / SS_tot)

Or equivalently:

R² = (SS_explained / SS_tot)

Where:

  • SS_res = Σ(y - ŷ)² = Sum of squared residuals (errors)
  • SS_tot = Σ(y - ȳ)² = Total sum of squares
  • SS_explained = SS_tot - SS_res = Variance explained by model

The Key Difference

Regular R² has a fundamental flaw: it ALWAYS increases when you add more variables, even if they're completely useless!

Example Showing the Problem:

Let's say you're predicting house prices:

Model Variables Adjusted R²
Model 1 Square footage only 0.70 0.69
Model 2 + Number of bedrooms 0.75 0.73
Model 3 + Zip code 0.80 0.77
Model 4 + Owner's birthday 0.81 0.76 ⬇️
Model 5 + Favorite color 0.82 0.74 ⬇️

Notice:

  • Regular R² keeps going up (0.70 → 0.82) even with nonsense variables
  • Adjusted R² starts decreasing (0.77 → 0.74) when we add useless variables!

Why Regular R² Always Increases

Mathematically, adding any variable (even random noise) gives the model more "flexibility" to fit the training data:

  • More parameters = more ways to reduce SS_res
  • SS_tot stays the same
  • Therefore, R² = 1 - (SS_res/SS_tot) must increase

The Adjusted R² Solution

Adjusted R² adds a penalty for each additional variable:

Adjusted R² = 1 - [(1-R²)(n-1)/(n-k-1)]

Breaking this down:

  • (1-R²) = Unexplained variance proportion
  • (n-1)/(n-k-1) = Penalty factor that increases with more predictors (k)
  • As k increases, the denominator (n-k-1) gets smaller, making the fraction larger
  • This increases the subtracted term, lowering Adjusted R²

Numerical Example

Dataset: 100 observations (n=100)

Model with 1 predictor (k=1):

  • R² = 0.60
  • Adjusted R² = 1 - [(1-0.60)(99)/(98)]
  • Adjusted R² = 1 - [0.40 × 1.0102]
  • Adjusted R² = 0.596 (barely different)

Model with 10 predictors (k=10):

  • R² = 0.65 (higher!)
  • Adjusted R² = 1 - [(1-0.65)(99)/(89)]
  • Adjusted R² = 1 - [0.35 × 1.112]
  • Adjusted R² = 0.611 (less impressive gain)

Model with 50 predictors (k=50):

  • R² = 0.80 (much higher!)
  • Adjusted R² = 1 - [(1-0.80)(99)/(49)]
  • Adjusted R² = 1 - [0.20 × 2.02]
  • Adjusted R² = 0.596 (actually WORSE than simpler model!)

When to Use Each

Use Regular R²:

  • Comparing models with the same number of predictors
  • Simple linear regression (one predictor)
  • When you want to know raw explanatory power
  • Academic requirement for basic reporting

Use Adjusted R²:

  • Comparing models with different numbers of predictors
  • Multiple regression (several predictors)
  • Model selection (choosing best model)
  • Preventing overfitting

Quick Decision Rule

if number_of_predictors == 1:
    use_regular_R2()  # They're nearly identical
elif comparing_models_with_different_predictors:
    use_adjusted_R2()  # Must account for complexity
else:
    report_both()  # Full transparency

The Bottom Line

  • Regular R²: Pure measure of variance explained, but naive about model complexity
  • Adjusted R²: Smarter measure that balances fit quality against model simplicity

Think of Regular R² as raw score and Adjusted R² as the grade after curve adjustment. Adjusted R² essentially asks: "Is this variable improving the model enough to justify making it more complex?" If not, Adjusted R² will decrease even though Regular R² increases!

Extra Reading:

Pearson's r (Pearson correlation coefficient) is a statistical measure that quantifies the linear relationship between two continuous variables. It tells you both the strength and direction of a linear association.

What It Measures

Pearson's r captures:

  • Direction: Whether variables move together (positive) or in opposite directions (negative)
  • Strength: How closely the points follow a straight line
  • Range: Always between -1 and +1

Interpretation

  • r = +1: Perfect positive linear relationship (as X increases, Y increases perfectly)
  • r = 0: No linear relationship (variables are linearly independent)
  • r = -1: Perfect negative linear relationship (as X increases, Y decreases perfectly)
  • |r| > 0.7: Generally considered strong correlation
  • 0.3 < |r| < 0.7: Moderate correlation
  • |r| < 0.3: Weak correlation

Mathematical Formula

r = Σ[(xi - x̄)(yi - ȳ)] / √[Σ(xi - x̄)² × Σ(yi - ȳ)²]

Or in a more intuitive form:

r = covariance(X,Y) / (std_dev(X) × std_dev(Y))

This is essentially the standardized covariance - it measures how variables vary together, normalized by their individual variations.

Visual Examples

r ≈ +0.9:  •     • •
          • • • •
        • • •
      • •

r ≈ 0:    • • •
        •   •   •
          • • •
        •   •   •

r ≈ -0.9:  • •
            • • •
              • • • •
                • •

Key Limitations

  1. Only measures LINEAR relationships: Pearson's r can be near zero even with strong non-linear relationships (e.g., parabolic, exponential)

  2. Sensitive to outliers: A single extreme point can dramatically change r

  3. Correlation ≠ Causation: Even r = 0.99 doesn't mean one variable causes the other

  4. Assumes normal distribution: Most reliable when both variables are roughly normally distributed

Relationship to R²

  • In simple linear regression: R² = r²
  • R² tells you the proportion of variance explained
  • r tells you the direction and strength of the linear relationship
  • Example: r = -0.8 means strong negative correlation; R² = 0.64 means 64% of variance explained

Practical Example

Temperature vs Ice Cream Sales:

  • r ≈ +0.85: Strong positive correlation
  • As temperature rises, ice cream sales tend to increase
  • R² ≈ 0.72: Temperature explains about 72% of the variation in ice cream sales

Pearson's r is fundamental in statistics and machine learning, particularly for feature selection, understanding variable relationships, and as a building block for more complex analyses.

Observed Values (y)

  • The actual, true values from your dataset
  • The ground truth labels/targets you're trying to predict
  • What actually happened in reality
  • What you're hoping your model will produce or output 
  • Example: The actual house price of $500,000

Predicted Values (ŷ)

  • The actual output from your neural network
  • What the model thinks the value should be based on the input features
  • The result after forward propagation through all layers
  • Example: The neural network's prediction of $485,000 for that house

The Relationship

The whole point of training is to minimize the difference between these two:

  • Loss function measures the difference (e.g., MSE = mean of (y - ŷ)²)
  • Backpropagation adjusts weights to make predicted values closer to observed values
  • Perfect model would have predicted values = observed values (never happens in practice)

During training:

  • You feed inputs → neural network produces predicted values
  • You compare these to observed values from your training set
  • The difference drives the learning process

Observed values are what you're trying to match (hoping to see) , predicted values are what your model produces.

Here are 5 interview questions about R² (R-squared) in neural networks and general machine learning:

Questions

1. Fundamental Understanding What is R² (coefficient of determination) and how is it calculated? What does an R² value of 0.7 mean in practical terms?

2. Interpretation Challenges Why can R² sometimes be negative when evaluating a model on test data? What does this indicate about model performance?

3. Neural Networks vs Linear Models When using R² as a metric for neural network regression tasks, what are some key differences or considerations compared to using it with linear regression models?

4. Limitations and Alternatives What are the main limitations of using R² as the sole evaluation metric for regression problems? What complementary metrics would you recommend using alongside R²?

5. Practical Scenario You've trained a neural network for a regression task and obtained an R² of 0.95 on training data but 0.3 on validation data. What might be happening and how would you address it?


Answers

1. Fundamental Understanding R² measures the proportion of variance in the dependent variable that's predictable from the independent variables. It's calculated as: R² = 1 - (SS_res / SS_tot), where SS_res is the sum of squared residuals (Errors) and SS_tot is the total sum of squares (from Mean.) An R² of 0.7 means the model explains 70% of the variance in the target variable, with the remaining 30% unexplained.

2. Interpretation Challenges R² can be negative on test data when the model performs worse than a horizontal line at the mean of the test set. This happens when SS_res > SS_tot, typically indicating the model is making predictions that are systematically far from the actual values - often due to overfitting on training data or distribution shift between train and test sets.

3. Neural Networks vs Linear Models Key considerations include: (a) Neural networks can capture non-linear relationships, potentially achieving higher R² than linear models on complex data; (b) R² in neural networks is more prone to overfitting due to high model capacity; (c) The interpretation is less straightforward since neural networks don't provide simple coefficient interpretations; (d) R² should be monitored on validation sets during training to detect overfitting early.

4. Limitations and Alternatives Limitations: R² doesn't indicate whether predictions are biased, can be artificially inflated by adding parameters, doesn't show if the model violates assumptions, and can be misleading for non-linear relationships. Complementary metrics: MAE (Mean Absolute Error), RMSE (Root Mean Squared Error), MAPE (Mean Absolute Percentage Error), residual plots, and prediction interval coverage.

5. Practical Scenario This is classic overfitting. The model has memorized training data but fails to generalize. Solutions include: (a) Add regularization (L1/L2, dropout); (b) Reduce model complexity (fewer layers/neurons); (c) Increase training data or use data augmentation; (d) Implement early stopping based on validation R²; (e) Use cross-validation to better assess generalization; (f) Check for data leakage or distribution differences between train/validation sets.


Comments

Popular posts from this blog

Simple Linear Regression - and Related Regression Loss Functions

Today's Topics: a. Regression Algorithms  b. Outliers - Explained in Simple Terms c. Common Regression Metrics Explained d. Overfitting and Underfitting e. How are Linear and Non Linear Regression Algorithms used in Neural Networks [Future study topics] Regression Algorithms Regression algorithms are a category of machine learning methods used to predict a continuous numerical value. Linear regression is a simple, powerful, and interpretable algorithm for this type of problem. Quick Example: These are the scores of students vs. the hours they spent studying. Looking at this dataset of student scores and their corresponding study hours, can we determine what score someone might achieve after studying for a random number of hours? Example: From the graph, we can estimate that 4 hours of daily study would result in a score near 80. It is a simple example, but for more complex tasks the underlying concept will be similar. If you understand this graph, you will understand this blog. Sim...

What problems can AI Neural Networks solve

How does AI Neural Networks solve Problems? What problems can AI Neural Networks solve? Based on effectiveness and common usage, here's the ranking from best to least suitable for neural networks (Classification Problems, Regression Problems and Optimization Problems.) But first some Math, background and related topics as how the Neural Network Learn by training (Supervised Learning and Unsupervised Learning.)  Background Note - Mathematical Precision vs. Practical AI Solutions. Math can solve all these problems with very accurate results. While Math can theoretically solve classification, regression, and optimization problems with perfect accuracy, such calculations often require impractical amounts of time—hours, days, or even years for complex real-world scenarios. In practice, we rarely need absolute precision; instead, we need actionable results quickly enough to make timely decisions. Neural networks excel at this trade-off, providing "good enough" solutions in seco...

Activation Functions in Neural Networks

  A Guide to Activation Functions in Neural Networks 🧠 Question: Without activation function can a neural network with many layers be non-linear? Answer: Provided at the end of this document. Activation functions are a crucial component of neural networks. Their primary purpose is to introduce non-linearity , which allows the network to learn the complex, winding patterns found in real-world data. Without them, a neural network, no matter how deep, would just be a simple linear model. In the diagram below the f is the activation function that receives input and send output to next layers. Commonly used activation functions. 1. Sigmoid Function 2. Tanh (Hyperbolic Tangent) 3. ReLU (Rectified Linear Unit - Like an Electronic Diode) 4. Leaky ReLU & PReLU 5. ELU (Exponential Linear Unit) 6. Softmax 7. GELU, Swish, and SiLU 1. Sigmoid Function                       The classic "S-curve," Sigmoid squashes any input value t...