Simple Linear Regression - and Related Regression Loss Functions

Today's Topics:

a. Regression Algorithms
b. Outliers - Explained in Simple Terms
c. Common Regression Metrics Explained
d. Overfitting and Underfitting

e. How are Linear and Non Linear Regression Algorithms used in Neural Networks [Future study topics]

Regression Algorithms

Regression algorithms are a category of machine learning methods used to predict a continuous numerical value. Linear regression is a simple, powerful, and interpretable algorithm for this type of problem.

Quick Example: These are the scores of students vs. the hours they spent studying. Looking at this dataset of student scores and their corresponding study hours, can we determine what score someone might achieve after studying for a random number of hours? Example: From the graph, we can estimate that 4 hours of daily study would result in a score near 80. It is a simple example, but for more complex tasks the underlying concept will be similar. If you understand this graph, you will understand this blog.

Simple Linear Regression

Simple Linear Regression models the relationship between a single independent variable and a single dependent variable. The relationship is represented by a straight line, like the formula $y = a x + b$ . The goal of the model is to find the best-fit line for the data.

The model has a few key components:

Same simple linear regression formula in another avatar, shown here to avoid confusion with your further research on the web:

$Y = β_{0} + β_{1} X + ϵ$

Other forms of the same equation - do not get confused.

High school Math use the notation a and b or m and c

Y = a + bX or Y = aX + b or y = mx + c

Machine Learning books use w's (weights) instead

Y = w₀ + w₁X

Books on Statistics use β's (Beta's)

Y = β₀ + β₁X

Let's break down each part using Y = β₀ + β₁X +ϵ

[ϵ or epsilon is also shown to identify the Error, β₀ is the constant or y intercept or bias]

Dependent Variable (Y): This is the dependent variable—the value we are trying to predict (e.g., test score).
Independent Variable (X): This is the independent variable—the value we are using to make the prediction (e.g., hours studied).
The Slope or Weight ( $β_{1}$ ): Beta 1 - This is the slope of the line. It tells us how much $Y$ is expected to increase for every one-unit increase in $X$ . [This controls how much X contributes to Y. It indicates the change in Y for a one-unit change in X.] For instance, if $β_{1}$ is 5, it means for every extra hour of study, the test score is predicted to go up by 5 points.
$The Intercept (β_{0})$ : Beta 0 - This is the Y-intercept. It's the predicted value of $Y$ when $X$ is zero. In our example, it would be the predicted test score for a student who studied for 0 hours.
$Error or ϵ$ (epsilon): This is the error term. It represents the difference between the actual observed value of $Y$ and the value predicted by the line. No model is perfect, and this term accounts for the random variation or "noise" in the data.

Assumptions for Simple Linear Regression

For a simple linear regression model to be effective, it assumes:

Linear Relationship: A linear relationship exists between the variables.
Homoscedasticity: The variance of the residuals (errors) is constant across all levels of the independent variable.
Normality of Errors: The errors should be normally distributed around zero.
Uncorrelated Errors: There is no systematic pattern in the errors, meaning they are independent of each other.
No multicollinearity: No or little multicollinearity

More details:

A linear relationship is a fundamental assumption in linear regression. It means that the dependent variable and each independent variable have a straight-line relationship. This is crucial because the linear regression model uses a straight line to make predictions.

This assumption implies that for every one-unit increase in the independent variable, the dependent variable changes by a constant amount. If this assumption is violated and the data shows a curved pattern, a simple straight-line model won't fit the data accurately.

For example, a correct application of linear regression would be on a dataset where the points roughly form a straight line. An incorrect application would be on data that forms a parabolic or exponential curve.

Visualizing the Relationship

This graph shows data with a linear relationship. A straight line is the most appropriate model to fit the trend of these data points.

This graph shows data with a non-linear relationship. A straight line would be a poor fit, and a linear regression model would not be an adequate choice.

Linear Relationship

Non-Linear Relationship

Homoscedasticity is a key assumption in linear regression, stating that the variance of the residuals (errors) is constant across all levels of the independent variable. This means the scatter of data points around the regression line is consistent and even throughout the range of the independent variable.

To visualize this, a residual plot is used, which graphs the residuals (the difference between the actual and predicted values) against the independent variable.

Homoscedasticity vs. Heteroscedasticity

Homoscedasticity: The residual plot shows a random, even scatter of points with no discernible pattern, forming a consistent band around the zero line. This indicates that the model's predictive accuracy is uniform across all data points.

Heteroscedasticity: The residual plot shows a clear pattern, such as a cone shape where the residuals get larger as the independent variable increases. This indicates that the model is less accurate for certain ranges of the independent variable.

The error term in a linear regression model should not have any signal and should resemble pure noise. When the errors are homoscedastic, the model is considered more reliable because this core assumption is met.

Normality of Errors is a key assumption in linear regression, stating that the errors (also called residuals) of the model are normally distributed. This means that when you plot the frequency of the errors, the distribution should form a bell curve centered around zero.

Normality of Errors means that when your regression model makes prediction mistakes, most errors should be small (close to zero), with fewer medium errors, and very rare large errors—creating a bell-shaped distribution pattern.

Think of it like throwing darts: most land near the bullseye, some are medium-distance away, and very few are way off target. This pattern tells us the model's mistakes are random and unbiased, not systematically wrong in one direction. If errors aren't normal (like all your darts landing in the bottom-left corner), it signals your model is missing something important or being influenced by outliers. Just as a basketball player's missed shots should scatter randomly around the basket rather than all going left, regression errors should be randomly distributed around zero. This assumption helps ensure our predictions are reliable and our model truly understands the underlying pattern rather than being consistently biased.

Visualizing Normality of Errors

The normal distribution of errors implies a few things:

Errors are concentrated around zero. This is a good sign, as it means most of the model's predictions are close to the actual values.
Large errors are rare. Both large positive and large negative errors should be infrequent.
Errors are symmetric. The frequency of positive errors should be similar to the frequency of negative errors.

If the errors are not normally distributed, it can indicate that the model is missing some important information or that the assumptions of linear regression have been violated. For example, if the error distribution is skewed or has multiple peaks, the model may not be adequate.

This graph shows the ideal scenario: the errors are concentrated around zero and follow a normal (bell-shaped) distribution.

Uncorrelated errors, also known as independence of errors, is a key assumption in linear regression. It means that the error for one data point is not related to the error of any other data point¹. In other words, there should be no systematic pattern in the residuals.

Visualizing Uncorrelated Errors

A residual plot is used to check for uncorrelated errors. This plot shows the residuals on the y-axis and either the independent variable or the observation order on the x-axis.

Uncorrelated Errors: A plot with uncorrelated errors shows a random scatter of points with no discernible pattern or trend. The errors are essentially "pure noise" and don't contain any hidden signal

Correlated Errors: A plot with correlated errors would show a pattern, such as a wave-like or cyclical trend. This indicates that your model is missing some information or a variable that could explain the pattern.

If the error of one observation can predict the error of the next observation, the assumption is violated⁴. This suggests that the model is inadequate because a part of the pattern in the data is not being captured by the current formula

No multicollinearity: No or little multicollinearity (All independent variables should correlate with the dependent variable but not with each other.)

The Problem:

Imagine trying to predict ice cream sales using both "temperature in Fahrenheit" and "temperature in Celsius" as inputs. These two variables are perfectly correlated—they're essentially the same information in different units. This creates multicollinearity, where independent variables are highly correlated with each other rather than just with the outcome.

Why It's Bad:

When variables are highly correlated, the model can't figure out which one is actually doing the work. It's like having two people take credit for the same task—you can't tell who really contributed. This makes the coefficients unreliable and unstable; small changes in data can wildly swing which variable appears important.

Real Example:

Predicting house prices using both "number of rooms" and "square footage"—these are highly related since bigger houses tend to have more rooms. The model struggles to separate their individual effects, leading to weird results like suggesting adding rooms decreases price (when really, the square footage variable already captured that information).

How to Detect:

Check correlation between your input variables—if two variables have correlation above 0.8 or 0.9, you likely have multicollinearity. The Variance Inflation Factor (VIF) formally measures this—VIF above 10 indicates serious problems.

The Solution:

Remove one of the correlated variables (keep temperature in Fahrenheit OR Celsius, not both), combine them into a single feature (create a "house size" variable instead of rooms and square footage separately), or use regularization techniques like Ridge regression that handle correlated features better.

Bottom Line:

Your independent variables should be independent from each other, not just related to your outcome—otherwise, your model gets confused about what's really causing the effect.

Multiple Linear Regression

Multiple Linear Regression is used when you have more than one input variable to predict a single outcome. The formula is an extension of the simple linear regression equation to include multiple variables. The equation is represented as: $\overset{y}{^} = \hat{β}_{0} + \hat{β}_{1} X_{1} + \hat{β}_{2} X_{2} + \cdot \cdot \cdot + \hat{β}_{k} X_{k}$ .

In this model, each coefficient represents the change in the dependent variable for a one-unit change in its corresponding independent variable, while holding all other variables constant.

Challenges with Multiple Linear Regression

As the number of input variables increases, certain challenges arise:

Curse of Dimensionality: With more dimensions, data becomes sparser, making it harder to get accurate model estimates.
R-squared Adequacy: In multiple regression, $R^{2}$ is not the best metric because it will always increase when you add more inputs, even if they are irrelevant.
Adjusted R-squared: A better alternative to R-squared, as it accounts for the number of predictors and penalizes the inclusion of irrelevant ones.
Multicollinearity: This occurs when two or more independent variables are highly correlated with each other. It can make it difficult to interpret coefficients and can lead to inflated standard errors.

Addressing Multicollinearity

Variance Inflation Factor (VIF): A measure used to detect multicollinearity. A VIF value greater than 5 suggests potential problems, while a value over 10 indicates high multicollinearity.
Solutions: You can address multicollinearity by removing highly correlated variables, combining them, or using regularization techniques like Ridge (L2) or Lasso (L1) regression. These methods help to prevent model weights from becoming too large and reduce the impact of correlated features.

Outliers - Explained in Simple Terms

An outlier is a data point that's significantly different from all the others—it's the "weird one" that doesn't fit the pattern.

An outlier graph sample typically shows data points that are significantly different from the other data in the set, often depicted as points lying far outside the general pattern of a scatter plot or as distinct points beyond the "whiskers" of a box plot. These graphs help visualize data that may result from special causes or significant deviations from the expected trend, enabling further investigation into their meaning.

Real-Life Examples

In a Classroom:

Most students are 11-13 years old
One student is 16 (got held back)
That 16-year-old is an outlier

Test Scores:

Most scores: 75, 82, 79, 85, 77
One score: 23
That 23 is an outlier (maybe they were sick?)

House Prices in Your Neighborhood:

Most houses: $300k-400k
Regular pattern of similar homes
One mansion: $2 million
The mansion is an outlier

Visual Example

Normal pattern:  • • • • • • • •
With outlier:    • • • • • • • •                    •
                                                (way out here!)

Why Outliers Matter

They can mess up your predictions:

Average height in your class: 5 feet
Add one 6'5" basketball player
Now average jumps to 5'2" (doesn't represent anyone well!)

In Our House Price Model:

10 normal houses: Your model learns the pattern
1 celebrity mansion for $10 million
Model gets confused: "Should all big houses cost millions?"

Types of Outliers

Natural outliers: Real but rare

Michael Jordan's height in a normal population
Einstein's IQ score
A 100-year-old tree in your yard

Error outliers: Mistakes in data

Someone typed $3,000,000 instead of $300,000
Scale showed 1,500 pounds (broken!)
Test score of 150% (impossible!)

How to Spot Outliers

The "Wow, that's weird!" test: If you look at data and think "that can't be right!" - it's probably an outlier

Statistical rules:

More than 3 standard deviations from average
Outside the "whiskers" on a box plot
Makes your RMSE much bigger than MAE

What to Do with Outliers

Check if it's an error - Fix it if wrong
Keep if real and important - That mansion is really there
Remove if it's messing up patterns - One mansion shouldn't ruin predictions for normal houses
Use special techniques - Some algorithms handle outliers better

The Birthday Party Example

You're tracking party attendance:

Party 1: 15 kids
Party 2: 12 kids
Party 3: 18 kids
Party 4: 200 kids (celebrity's kid!)

That 200 is an outlier - including it would make you think all parties need supplies for 61 kids on average!

Bottom Line: Outliers are the "oddballs" in your data that don't follow the normal pattern - sometimes they're important, sometimes they're errors, but they always need special attention!

Regression Loss Functions

Common Regression Metrics Explained with Real Examples for Evaluating the Model

If we have one value to predict, it is easy to find how accurate it is. For example if the expected value was say 100 and we predicted say 80, the error is -20%. But for a huge set of data we need to find the metrics that gives the all inclusive error or overall error, or say the average error. Various metrics are used for this purpose.

Common metrics for evaluating a regression model's performance include:

Mean Absolute Error (MAE): The average of the absolute differences between actual and predicted values. It’s easy to understand and not very sensitive to outliers.
Mean Squared Error (MSE): These metrics penalize larger errors more heavily, making them more sensitive to outliers.
Root Mean Square Error (RMSE): Similar to MSE, these metrics penalize larger errors more heavily, making them more sensitive to outliers.
R-squared ( $R^{2}$ ): This measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It's a standardized metric between 0 and 1.

1. Mean Absolute Error (MAE)

Mean Absolute Error (MAE) is a metric used to measure the accuracy of a regression model. It calculates the average of the absolute differences between the actual values and the values predicted by the model.

Think of it as the average "magnitude" of the errors, without caring whether the prediction was too high or too low. A smaller MAE value indicates a better-fitting model. 📏

The Formula

The formula for Mean Absolute Error is:

Let's break down each part (leave this here for the blog text reader):

n: The total number of data points in your dataset.
∑: The summation symbol, which means you add everything that comes after it.
$y_{i}$ :: The actual value for the i-th data point.
y^i (pronounced "y-hat"): The predicted value for the i-th data point.
$∣ y_{i} - \overset{y}{^}_{i} ∣$ : The absolute value of the difference between the actual and predicted value. This is the "error" for a single prediction, and taking the absolute value ensures it's always positive.

How to Calculate It (Step-by-Step)

Let's use a simple example. Imagine a model that predicts house prices.

Actual Price ( $y_{i}$ )	Predicted Price ( $\overset{y}{^}_{i}$ )
$250,000	$240,000
$300,000	$320,000
$420,000	$425,000

Calculate the Error for Each Prediction: Find the difference between the actual and predicted price for each house.
- Error 1: $250, 000 - 240, 000 = 10, 000$
- Error 2: $300, 000 - 320, 000 = - 20, 000$
- Error 3: $420, 000 - 425, 000 = - 5, 000$
Take the Absolute Value of Each Error: This removes the negative signs.
- $∣10, 000∣ = 10, 000$
- $∣ - 20, 000∣ = 20, 000$
- $∣ - 5, 000∣ = 5, 000$
Sum the Absolute Errors: Add them all up.
- $10, 000 + 20, 000 + 5, 000 = 35, 000$
Divide by the Number of Data Points: In this case, we have 3 data points.
- $M A E = \frac{35 , 000}{3} \approx 11, 667$

Interpretation: On average, the model's price predictions are off by about $11,667.

Key Characteristics of MAE

Easy to Interpret: The result is in the same units as the target variable (e.g., dollars, temperature, etc.), making it very straightforward to understand.
Robust to Outliers: Because it doesn't square the errors, a single large error (an outlier) will have a less significant impact on the total MAE compared to other metrics like Mean Squared Error (MSE). Each error contributes to the total in direct proportion to its magnitude.

Calculation of Mean Absolute Error (MAE) for the House Example:

Another example:

Setting the Context: House Price Prediction Model

Let's say you built a model to predict house prices in your neighborhood. You test it on 5 houses where you know the actual selling prices:

Now let's calculate each metric using these errors:

Average of absolute errors

Ignore the signs: $10k, $30k, $20k, $10k, $10k
MAE = ($10k + $30k + $20k + $10k + $10k) ÷ 5 = $16,000
What it means: "On average, I'm off by $16,000 per house"
Advantage: Easy to explain to clients

Calculating MAE - It's Just Finding the Average!

Step 1: Ignore whether you guessed too high or too low

Just look at how far off you were: $10,000, $30,000, $20,000, $10,000, $10,000

Step 2: Add them up

$10,000 + $30,000 + $20,000 + $10,000 + $10,000 = $80,000 total error

Step 3: Divide by how many guesses you made

$80,000 ÷ 5 houses = $16,000

MAE = $16,000

What This Means

"On average, your guesses are $16,000 away from the real price"

It's like saying:

If you guess your friend will make 7 free throws but they make 4, you're off by 3
If you guess 5 and they make 8, you're off by 3
Your average error is 3 shots

Why MAE is Cool

It's fair: One really bad guess doesn't ruin everything
It's simple: Just "how far off am I on average?"
It makes sense: Everyone understands what "off by $16,000" means

If your MAE is small, you're a good guesser. If it's big, you need more practice!

2. Mean Squared Error (MSE)

Mean Squared Error (MSE) is one of the most common metrics used to measure the performance of a regression model. It calculates the average of the squared differences between the model's predictions and the actual values.

By squaring the errors, MSE places a much heavier penalty on larger errors. This means the model is encouraged to avoid making significant mistakes, as even one very wrong prediction can dramatically increase the MSE.

The Formula

The formula for Mean Squared Error is:

MSE = \frac{1}{n} i = 1 \sum n (y_{i} - \overset{y}{^}_{i})^{2}

Let's break it down:

n: The total number of data points.
∑: The summation symbol, telling us to add up the results for all data points.
yi: The actual value for the i-th data point.
y^i: The predicted value for the i-th data point.
$(y_{i} - \overset{y}{^}_{i})^{2}$ : The squared difference between the actual and predicted value. This is the key step that penalizes larger errors more severely.

How to Calculate It (Step-by-Step)

Let's use the same house price example to see how MSE is calculated.

Actual Price ( $y_{i}$ )	Predicted Price ( $\overset{y}{^}_{i}$ )
$250,000	$240,000
$300,000	$320,000
$420,000	$425,000

Calculate the Error for Each Prediction: Find the difference between actual and predicted prices.
- Error 1: $250, 000 - 240, 000 = 10, 000$
- Error 2: $300, 000 - 320, 000 = - 20, 000$
- Error 3: $420, 000 - 425, 000 = - 5, 000$
Square Each Error: This is the crucial step. It makes all errors positive and heavily penalizes the larger ones.
- $(10, 000)^{2} = 100, 000, 000$
- $(- 20, 000)^{2} = 400, 000, 000$
- $(- 5, 000)^{2} = 25, 000, 000$
Sum the Squared Errors: Add them all up.
- $100, 000, 000 + 400, 000, 000 + 25, 000, 000 = 525, 000, 000$
Divide by the Number of Data Points: We have 3 data points.
- $MSE = \frac{525 , 000 , 000}{3} = 175, 000, 000$

Interpretation: The MSE is 175,000,000. Notice how the single error of 20,000 had a much larger impact on the final MSE compared to its impact on MAE.

Key Characteristics of MSE

Sensitive to Outliers: Because errors are squared, MSE is highly sensitive to outliers. A model that makes a few very large errors will have a much higher MSE than a model that makes many small errors.
Units are Squared: The result is in squared units (e.g., dollars squared), which is not directly interpretable in the context of the original data. This is a major drawback. To address this, people often take the square root of the MSE to get the Root Mean Squared Error (RMSE), which returns the error metric to the original units.
Useful for Optimization: The MSE function is smooth and differentiable, which makes it easier for optimization algorithms (like gradient descent) to find the model parameters that minimize the error.

Let's say you built a model to predict house prices in your neighborhood. You test it on 5 houses where you know the actual selling prices:

Now let's calculate each metric using these errors:

Calculation: Average of squared errors

Square each: 100, 900, 400, 100, 100 (in millions)
MSE = (100 + 900 + 400 + 100 + 100) ÷ 5 = 320 million dollars²
What it means: Hard to interpret (what's a "squared dollar"?)
Why use it: That $30k error contributes 900/1600 = 56% of total error!

Why Square the Numbers?

Think of it like this:

Being off by $10,000 = Small mistake
Being off by $30,000 = BIG mistake

When you square (multiply by itself):

Small mistakes stay relatively small
Big mistakes become HUGE

The Math

Add up all the "punishment points": 100 + 900 + 400 + 100 + 100 = 1,600 million
Divide by 5 houses: 1,600 ÷ 5 = 320 million

MSE = 320 million "squared dollars"

The Video Game Analogy

Imagine a game where:

Missing by 1 point = lose 1 life
Missing by 2 points = lose 4 lives
Missing by 3 points = lose 9 lives
Missing by 10 points = lose 100 lives!

MSE is harsh on big mistakes!

Why This is Weird but Useful

The weird part:

What's a "squared dollar"? (Nobody knows!)
The number (320 million) seems meaningless

The useful part:

That one $30,000 mistake contributes 900 out of 1,600 total points (56% of all error!)
It screams: "FIX YOUR BIG MISTAKES FIRST!"

Real-Life Example

If you're baking cookies:

1 minute overcooked = slightly crispy (small error)
10 minutes overcooked = burnt black (HUGE error)

MSE says the burnt batch is way worse than 10 slightly crispy batches!

MSE vs MAE

MAE: "You're off by $16,000 on average" (treats all errors equally)
MSE: "That $30,000 mistake is ruining your score!" (punishes big errors hard)

Computers love MSE because the math works better, but humans prefer MAE because we understand what "$16,000 error" means!

3. Root Mean Square Error (RMSE)

Root Mean Squared Error (RMSE) is a very popular metric for evaluating regression models. It's the square root of the Mean Squared Error (MSE). The main reason for using RMSE is to convert the error metric back into the same units as the original target variable, which makes it much easier to interpret.

Essentially, RMSE represents the standard deviation of the prediction errors (residuals). It tells you how concentrated the data is around the line of best fit. A smaller RMSE indicates a better model.

The Formula

The formula for Root Mean Squared Error is simply the square root of the MSE formula:

The components are the same as in MSE:

n: The total number of data points.
∑: The summation symbol.
yi: The actual value.
y^i: The predicted value.
$(y_{i} - \overset{y}{^}_{i})^{2}$ : The squared difference between the actual and predicted value.

How to Calculate It (Step-by-Step)

Calculating RMSE is just one extra step after calculating MSE. Let's use our house price example where we already found the MSE.

Calculate the MSE: From the previous example, we calculated the MSE to be 175,000,000 (dollars squared).
Take the Square Root of the MSE: This is the final step to get the RMSE.

Interpretation: The RMSE is approximately $13,229. Since this is in the same unit as our target (dollars), we can interpret this result as: the typical or standard deviation of the model's prediction error is about $13,229.

Key Characteristics of RMSE

Interpretable Units: This is the biggest advantage over MSE. Knowing the model's typical error is $13,229 is much more useful than knowing the squared error is 175,000,000.
Sensitive to Outliers: Just like MSE, RMSE is sensitive to outliers because the errors are squared before being averaged. A few large errors can disproportionately increase the RMSE.
Comparison with MAE: RMSE will always be greater than or equal to MAE. The greater the difference between them, the greater the variance in the individual errors in your sample. This indicates that your model is making a few very large errors.

Another example - Let's say you built a model to predict house prices in your neighborhood. You test it on 5 houses where you know the actual selling prices:

Now let's calculate each metric using these errors:

Calculation: Square root of MSE

RMSE = √320 million = $17,889
What it means: "Typical error is about $17,889"
Key insight: RMSE ($17,889) > MAE ($16,000) tells us we have some bigger errors pulling it up

The Magic of the Square Root

RMSE = √MSE = √320 million = about $17,889

It's like unwrapping a present:

MSE wrapped our error in confusing "squared" paper
RMSE unwraps it back to regular
Back to the House cost example
The Report Card Analogy

Imagine your test scores:
- Most tests: 85, 87, 84, 86 (pretty consistent)
- One bad test: 60 (oops!)
MAE says: "Average of 5 points off" (treats the 60 like any other score) RMSE says: "Typically 7 points off" (that 60 pulls it up more)

Why RMSE is Higher Than MAE

Notice: RMSE ($17,889) > MAE ($16,000)

This tells us: "You have some bigger mistakes hiding in there!"
- If all mistakes were exactly $16,000, RMSE would equal MAE
- Since RMSE is higher, that $30,000 mistake is pulling it up
The Weather Forecast Example

Your weather app predicts temperature:
- Monday: Off by 2°F
- Tuesday: Off by 3°F
- Wednesday: Off by 2°F
- Thursday: Off by 10°F (big miss!)
- Friday: Off by 3°F
MAE: "We're off by 4 degrees on average" RMSE: "We're typically off by 5 degrees"

RMSE is saying: "Watch out, we sometimes mess up badly!"

RMSE in Simple Words

Think of it as the "typical mistake" you make:
- Not the average (that's MAE)
- But what you'd expect on a normal guess
- Big mistakes make it worse
The Birthday Party Analogy

You're guessing how many kids will come to parties:

Party 1: Guessed 20, came 22 (off by 2) Party 2: Guessed 15, came 14 (off by 1) Party 3: Guessed 25, came 35 (off by 10!)
- MAE: "Off by 4.3 kids on average"
- RMSE: "Typically off by 6 kids"
RMSE warns you: "Sometimes you're way off!"

Why Use RMSE?
1. Back to normal units (dollars, not squared dollars)
2. Still punishes big mistakes (like MSE)
3. Easy to explain ("typically off by $17,889")
4. Alerts you to outliers (when it's much bigger than MAE)
It's the Goldilocks metric - not too simple (MAE), not too weird (MSE), but just right!

4. R-squared (R²)

R-squared (R²), the Coefficient of Determination

R-squared (R²) is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). In simpler terms, it tells you how well your model's predictions fit the actual data.

It provides a score between 0 and 1, which is often expressed as a percentage.

An R² of 1 (or 100%) means the model perfectly explains all the variability in the data.
An R² of 0 means the model explains none of the variability.

The Formula

To understand the R-squared formula, you first need to understand two key components:

Sum of Squares of Residuals: This is the sum of the squared errors of your model's predictions. It represents the variation that your model cannot explain.
(Where yi is the actual value and y^i is the predicted value)
Total Sum of Squares: This is the sum of the squared differences between each actual data point and the mean of all data points. It represents the total variation in the data. If you were to just guess the average value every time, this would be your error.
(Where yˉ is the mean of the actual values)

The R-squared formula combines these two values:

Essentially, it calculates the fraction of total variance that your model fails to explain (the fraction part above) and subtracts that from 1. The result is the fraction of variance that your model does explain.

How to Interpret It

Interpretation: An R-squared value of 0.819 means that 81.9% of the variation in house prices can be explained by your model (e.g., based on square footage, location, etc.). The remaining 18.1% is due to other factors not included in the model.

Key Characteristics of R-squared

Goodness of Fit: It's an intuitive measure of how well the regression line approximates the actual data points. A higher R² suggests a better fit.
A Major Flaw: R-squared always increases or stays the same when you add more independent variables to your model, even if those variables have no real relationship with the outcome. This can be misleading.
Adjusted R-squared: To overcome this flaw, especially in multiple regression, Adjusted R-squared is used. It penalizes the score for adding variables that don't improve the model's explanatory power.

Different calculation - measures explained variation:

If house prices in your area range from $200k to $600k (huge variation)
Your model predictions are mostly within $10-20k (small errors)
R² = 0.92 means "I explain 92% of why house prices differ"
Scale: 0 = useless model, 1 = perfect predictions

R² is like a report card for your prediction model - it tells you what percentage of the test you got right!

The Dart Board Analogy

Imagine all house prices are scattered like darts on a huge board:

Cheap houses ($200k) at the bottom
Expensive houses ($600k) at the top
Most houses somewhere in the middle

Without any guessing model: The darts are ALL OVER the place! With your model: Most darts cluster near where you predicted!

What R² Actually Measures

"What percentage of the messiness can I explain?"

The Test Score Example

Your class takes a test. Scores range from 50 to 95:

Smart Susan: 95
Average Alex: 75
Struggling Sam: 50

Question: Why do scores vary so much?

Your model says: "I studied their hours of homework!"

If R² = 0.80: "Homework explains 80% of why scores are different"
The other 20%? Maybe test anxiety, breakfast, natural talent...

Back to Our House Prices

Total variation: Houses range from $200k to $600k (huge spread!) Your predictions: Mostly within $10-20k of actual price

If R² = 0.92, you're saying:

"I can explain 92% of why some houses cost more than others"
Only 8% is still a mystery!

The Scale: 0 to 1 (or 0% to 100%)

The Weather Prediction Game

If you predict tomorrow's temperature:

R² = 0.20: You're barely better than guessing randomly (like saying "75°F" every day)
R² = 0.75: You're pretty good! You understand weather patterns
R² = 0.95: You're almost as good as professional meteorologists!

The Birthday Cake Analogy

Imagine explaining why some birthday cakes taste better:

R² = 0.85 means:

85% explained by: amount of sugar, baking time, fresh ingredients
15% unexplained: maybe the baker's mood, secret ingredients, luck!

Why R² is Different from MAE/RMSE

MAE/RMSE: "How wrong are my guesses?"
R²: "How much better am I than just guessing the average?"

It's like:

MAE says: "You missed the basketball hoop by 2 feet"
R² says: "You make 75% of your shots"

The Video Game Score

Think of it as your accuracy percentage in a game:

R² = 0.40: You only hit 40% of targets (need practice!)
R² = 0.90: You're hitting 90% of targets (pro level!)

Red Flags

R² = 0.10: Your model barely works - like guessing randomly
R² = 1.00: Too perfect! You might be cheating (overfitting)
R² = 0.70: Pretty good for real-world predictions!

In Simple Words

R² answers: "On a scale of 0 to 100, how good are my predictions?"

0 = Terrible (might as well flip a coin)
50 = Okay (half right, half wrong)
90 = Amazing (you really understand the pattern!)

It's your model's grade on the "prediction test"!

Other Real-World Examples:

Weather Forecasting:

MAE = 3°F → "Off by 3 degrees on average"
RMSE = 4°F → "Typical error is 4 degrees"
R² = 0.75 → "Captures 75% of temperature variation"

Student Grade Prediction:

MAE = 5 points → "Average error of 5 points"
RMSE = 7 points → "Typically off by 7 points"
R² = 0.65 → "Explains 65% of grade differences"

Which Metric to Choose?

Presenting to boss: Use MAE ("We're off by $16k on average")
Training models: Use MSE/RMSE (mathematics prefers squares)
Comparing models: Use R² (standardized 0-1 scale)
Checking for outliers: Compare RMSE vs MAE (bigger gap = more outliers)

Overfitting and Underfitting in Linear Regression - Explained Simply

The Three Bears of Model Fitting

Just like Goldilocks found porridge that was too hot, too cold, and just right, regression models can be too simple, too complex, or just right.

Underfitting - The Model That's Too Simple

What it is: Your model is too basic to capture the real pattern in the data.

House Price Example:

Reality: Price depends on size, location, age, condition, schools nearby
Your model: "All houses cost $300,000" (ignoring everything!)
Problem: Misses obvious patterns

Visual Example:

Data shows curve:    •     •
                   •   • •
                 •   •
Your line: _______________
          (straight through curved data)

School Grade Analogy: Predicting test scores by saying "everyone gets 75%" - ignores that studying actually matters!

Signs of Underfitting:

Low R² on training data (can't even fit what it sees)
High errors everywhere
Predictions barely better than guessing average

Overfitting - The Model That's Too Perfect

What it is: Your model memorizes the training data instead of learning the pattern.

House Price Example:

Your model memorizes: "Blue house = $321,456, Red house = $267,892"
New green house appears: Model panics! "I've never seen green!"
Problem: Can't handle new situations

The Test Memorization Analogy:

You memorize answers: "Question 1 = B, Question 2 = C"
Teacher changes question order: You fail!
You memorized answers, not how to solve problems

Visual Example:

Training:  • Perfect fit through
          • Every single point
         •  (wiggly line)
        •

New data:    ○  But misses all
            ○   the new points!
           ○

Just Right - Good Fit

What it is: Model captures the true pattern without memorizing noise.

Characteristics:

Good R² on training data
Similar performance on new data
Smooth, sensible predictions

The Birthday Party Planning Example

Underfitting: "Every party needs 20 pizzas" (ignoring guest count)

Overfitting: "Tommy's party with 12 kids needed 8 pizzas at 3:30pm on a Saturday when it was 72°F and his mom wore blue, so EXACTLY those conditions = 8 pizzas"

Good fit: "About 1 pizza per 3 kids, plus 2 extra"

Real-World Consequences

Underfitting in Medical Diagnosis:

Model: "Everyone probably has the flu"
Misses serious conditions
Dangerous oversimplification

Overfitting in Stock Prediction:

Model memorizes last year perfectly
This year: Market changes, model fails
Loses money on "sure thing" predictions

How to Detect

Metric	Underfitting	Good Fit	Overfitting
Training R²	Low (0.3)	High (0.85)	Perfect (0.99)
Test R²	Low (0.3)	Similar (0.82)	Much lower (0.4)
Training Error	High	Low	Nearly zero
Test Error	High	Low	High

The Study Guide Analogy

Underfitting Student:

Studies nothing
"Math is just adding numbers"
Fails both homework and test

Overfitting Student:

Memorizes exact homework problems
Can't solve slightly different test questions
Aces homework, fails test

Good Student:

Understands concepts
Practices various problems
Does well on both homework and test

Fixing the Problems

Underfitting Solutions:

Add more features (consider location, not just size)
Use polynomial terms (allow curves)
Get better data

Overfitting Solutions:

Simplify model (remove unnecessary features)
Get more training data
Use regularization (penalize complexity)
Cross-validation (test on held-out data)

The Key Insight

Your model should learn patterns, not memorize examples. It's like learning to ride any bike, not just memorizing how to ride your specific blue bike in your driveway!

How are Linear and Non Linear Regression Algorithms used in Neural Networks [these will be covered in detail in the future material]

1) Linear Regression in Neural Networks

Linear regression appears in neural networks in several key ways:

As the Basic Building Block

Every neuron in a neural network starts with a linear operation - it computes a weighted sum of its inputs plus a bias term. This is essentially linear regression:

z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b

In Output Layers for Regression Tasks

When solving regression problems (predicting continuous values), the output layer often uses pure linear regression without any activation function. For example, if you're predicting house prices or temperature, the final layer might just output the linear combination directly.

Single-Layer Networks

A neural network with no hidden layers and no activation function is literally just linear regression. This is sometimes called a "linear neural network" or perceptron without activation.

During Gradient Descent

The weight updates during training use linear approximations through gradients, which is fundamentally based on linear regression principles.

2) Non-Linear (Complex) Regression in Neural Networks

Non-linear regression is what gives neural networks their power:

Activation Functions Create Non-Linearity

After the linear operation in each neuron, an activation function (like ReLU, sigmoid, or tanh) introduces non-linearity:

output = activation(w₁x₁ + w₂x₂ + ... + wₙxₙ + b)

This transforms the linear regression into non-linear regression.

Deep Networks = Complex Non-Linear Functions

When you stack multiple layers with activation functions, you create increasingly complex non-linear regression models. Each layer builds upon the previous one's non-linear transformations, allowing the network to approximate virtually any continuous function (universal approximation theorem).

Common Scenarios for Non-Linear Regression:

Classification problems: Use sigmoid or softmax activations to map linear outputs to probabilities
Computer vision: Convolutional layers with ReLU create hierarchical non-linear feature extractors
Natural language processing: Transformer architectures use non-linear attention mechanisms
Complex pattern recognition: Any task where the relationship between inputs and outputs isn't linear (which is most real-world problems)

Key Insight

The power of neural networks comes from combining both:

Linear operations provide the mathematical framework for learning through gradient descent
Non-linear activations allow the network to learn complex patterns

Without activation functions, even a deep neural network would only be able to learn linear relationships (multiple linear transformations still yield a linear transformation). The non-linearity is essential for neural networks to solve complex problems like image recognition, language understanding, or game playing.

In practice, modern neural networks almost always use non-linear regression, except potentially in the final output layer for specific regression tasks. The choice between linear and non-linear depends on your problem - if you're modeling a truly linear relationship, a simple linear output might suffice, but most real-world phenomena require the non-linear capabilities that neural networks provide.

Interview Questions (More later):

Here are 5 interview questions about R² (R-squared) in neural networks and general machine learning:

Questions

1. Fundamental Understanding What is R² (coefficient of determination) and how is it calculated? What does an R² value of 0.7 mean in practical terms?

2. Interpretation Challenges Why can R² sometimes be negative when evaluating a model on test data? What does this indicate about model performance?

3. Neural Networks vs Linear Models When using R² as a metric for neural network regression tasks, what are some key differences or considerations compared to using it with linear regression models?

4. Limitations and Alternatives What are the main limitations of using R² as the sole evaluation metric for regression problems? What complementary metrics would you recommend using alongside R²?

5. Practical Scenario You've trained a neural network for a regression task and obtained an R² of 0.95 on training data but 0.3 on validation data. What might be happening and how would you address it?

Answers

1. Fundamental Understanding R² measures the proportion of variance in the dependent variable that's predictable from the independent variables. It's calculated as: R² = 1 - (SS_res / SS_tot), where SS_res is the sum of squared residuals and SS_tot is the total sum of squares. An R² of 0.7 means the model explains 70% of the variance in the target variable, with the remaining 30% unexplained.

2. Interpretation Challenges R² can be negative on test data when the model performs worse than a horizontal line at the mean of the test set. This happens when SS_res > SS_tot, typically indicating the model is making predictions that are systematically far from the actual values - often due to overfitting on training data or distribution shift between train and test sets.

3. Neural Networks vs Linear Models Key considerations include: (a) Neural networks can capture non-linear relationships, potentially achieving higher R² than linear models on complex data; (b) R² in neural networks is more prone to overfitting due to high model capacity; (c) The interpretation is less straightforward since neural networks don't provide simple coefficient interpretations; (d) R² should be monitored on validation sets during training to detect overfitting early.

4. Limitations and Alternatives Limitations: R² doesn't indicate whether predictions are biased, can be artificially inflated by adding parameters, doesn't show if the model violates assumptions, and can be misleading for non-linear relationships. Complementary metrics: MAE (Mean Absolute Error), RMSE (Root Mean Squared Error), MAPE (Mean Absolute Percentage Error), residual plots, and prediction interval coverage.

5. Practical Scenario This is classic overfitting. The model has memorized training data but fails to generalize. Solutions include: (a) Add regularization (L1/L2, dropout); (b) Reduce model complexity (fewer layers/neurons); (c) Increase training data or use data augmentation; (d) Implement early stopping based on validation R²; (e) Use cross-validation to better assess generalization; (f) Check for data leakage or distribution differences between train/validation sets.

Of course. Here is the question along with the detailed answer.

The Question

Which of the following is NOT a common way to evaluate the performance of a linear regression model?

Computing the adjusted R-squared
Plotting the residuals against the independent variables.
Performing a correlation test on the coefficients.
Computing the mean square error of the residuals

The Answer

The correct answer is "Performing a correlation test on the coefficients."

Explanation 💡

This is not a common way to evaluate the performance of a linear regression model. Here's a breakdown of why:

Why That's the Correct Answer

While the relationship between coefficients is important, we don't evaluate it with a "correlation test" as a measure of model performance.

The issue of highly correlated coefficients is called multicollinearity. It's a diagnostic problem that can make the model's coefficient estimates unstable and difficult to interpret. We typically check for it using a metric called the Variance Inflation Factor (VIF), not a simple correlation test. More importantly, this is a check on the model's stability and interpretability, not a direct measure of its predictive accuracy or overall fit.

Why the Other Options Are Common Evaluation Methods

Computing the adjusted R-squared: This is a fundamental method. Adjusted R² tells you the percentage of variance in the outcome that the model explains, while also penalizing the model for having too many predictors. It's a key measure of a model's goodness-of-fit.
Plotting the residuals against the independent variables: This is a crucial diagnostic step. A residual plot is used to visually check the assumptions of the linear model, such as linearity and constant variance of errors (homoscedasticity). Patterns in this plot indicate problems with the model's fit.
Computing the mean square error of the residuals: This is one of the most common metrics for performance. The Mean Square Error (MSE), or its square root (RMSE), directly measures the average size of the model's prediction errors. A lower MSE indicates a more accurate model.

Is this correct statement?

Linear regression can be used to model nonlinear relationships between variables.

Answer: True.

Linear regression can model nonlinear relationships, and the key technique to do this is called feature engineering.

Specifically, you can create new, non-linear features from your existing ones. The most common example is Polynomial Regression, where if you have a feature x, you create new features like $x^{2}$ , $x^{3}$ , etc.

The model is still considered "linear" because the resulting equation is linear in its coefficients. The algorithm simply treats $x$ and $x^{2}$ as two independent variables to find the best-fitting line, which in this case, is a curve.

Is this correct statement?

The slope coefficient in a linear regression model represents the change in the dependent variable per unit change in the independent variable.

True.

That statement is the exact definition of the slope coefficient. It quantifies the relationship between the two variables.

For every one-unit increase in the independent variable (the input, x), the model predicts that the dependent variable (the output, y) will change by the amount of the slope.

For example, in a model predicting Exam Score from Hours Studied, a slope of 5 means that for every additional hour of study, the student's exam score is expected to increase by five points.

What is inference in AI mean?

Inference is the process where a trained model applies its learned knowledge to new, unseen data to make predictions, classifications, or generate outputs

Is this correct statement?

Linear regression assumes that the errors are binomially distributed and have constant variance, which is required for inference.

False.

The statement is partially correct but contains a critical error.

Linear regression assumes the errors are normally distributed, not binomially distributed. A binomial distribution is for discrete outcomes, like coin flips, whereas regression errors are continuous.

The second part of the statement is correct: the model does assume that the errors have constant variance, a property known as homoscedasticity. Both of these assumptions are required for valid statistical inference.

What is Binomial Probability?

Binomial probability describes the likelihood of getting exactly k successes in n independent trials, where each trial has only two possible outcomes (success/failure) with constant probability.

Key Conditions for Binomial Distribution:

Fixed number of trials (n)
Two outcomes per trial (success/failure)
Independent trials
Constant probability (p) of success

The Binomial Formula:

P(X = k) = C(n,k) × p^k × (1-p)^(n-k)

Where:
- n = number of trials
- k = number of successes
- p = probability of success
- C(n,k) = combinations = n!/(k!(n-k)!)

In this example - that shows the binomially distributed coin tosses.

The change of success (say heads) are 0.5 for each coin toss (binomial outcome - just 2 - heads or tails).

If we toss the coin 50 times, the probably that

a. All are heads is close to 0

b. All are tails is close to 0

c. 25 are heads in the maximum probability (see the peak of the curve)

and so on.

Now, The normal distribution is a way of describing how data points are spread out. It's the most common distribution in all of statistics and is often called the "bell curve" because of its shape.

In a normal distribution, most of the data clusters around the average, and the further a value is from the average, the less likely it is to occur.

A perfect real-world example is the height of people. Most people are of average height, which forms the peak of the bell. Very tall and very short people are rare, and they make up the "tails" of the bell on either side.

Licensed by Google

Key Characteristics 📊

Every normal distribution can be described by just two numbers:

Mean ( $μ$ ): This is the average of the data and represents the center or peak of the bell curve.
Standard Deviation ( $σ$ ): This measures the spread of the data.
- A small standard deviation means the data is tightly clustered around the mean, resulting in a tall and skinny bell curve.
- A large standard deviation means the data is spread out, resulting in a short and wide bell curve.

In a perfect normal distribution, the mean, median, and mode are all the same value.

The Empirical Rule (68-95-99.7 Rule)

This is a very useful rule of thumb that tells you how your data is distributed within the curve:

68% of the data falls within 1 standard deviation of the mean.
95% of the data falls within 2 standard deviations of the mean.
99.7% of the data falls within 3 standard deviations of the mean.

Why is it So Important? 🧠

The normal distribution is a cornerstone of statistics for two main reasons:

It's Everywhere in Nature: Many natural phenomena, from blood pressure and IQ scores to measurement errors in experiments, tend to follow a normal distribution.
The Central Limit Theorem: This powerful theorem states that if you take a large number of samples from any distribution (even a weird, non-bell-shaped one) and calculate their averages, the distribution of those averages will form a normal distribution. This allows statisticians to make reliable inferences about populations.

Is this correct statement?

The matrix inversion method is the only way to estimate the coefficients in a linear regression model.

No, that statement is not correct.

While the matrix inversion method is one way to find the coefficients, it is not the only way. A more common and often more efficient method, especially for large datasets, is Gradient Descent.

Two Main Approaches 💡

There are two primary strategies for finding the best coefficients in a linear regression model: a direct analytical solution and an iterative optimization approach.

1. The Direct Method: Matrix Inversion

This method, also known as the Normal Equation, uses linear algebra to solve for the optimal coefficients in a single, direct calculation. It's an elegant, analytical solution that gives you the exact answer.

However, the process of inverting a matrix is computationally very expensive, especially for datasets with a large number of features. This makes it impractical for many real-world machine learning problems.

2. The Iterative Method: Gradient Descent

This is an optimization algorithm that finds the best coefficients through an iterative process. It works like this:

It starts with random guesses for the coefficients.
It calculates the model's error (or "cost").
It then "nudges" the coefficients in the direction that will most reduce the error.
It repeats steps 2 and 3 until the error is minimized.

Think of it like walking down a hill blindfolded; you take small steps in the steepest downward direction until you reach the bottom. Gradient Descent is much more scalable and is the foundational optimization algorithm for training not only linear regression models but also complex neural networks.

Artificial Intelligence Theory and Application

Simple Linear Regression - and Related Regression Loss Functions

Today's Topics:

Regression Algorithms

Simple Linear Regression

Assumptions for Simple Linear Regression

Visualizing the Relationship

Homoscedasticity vs. Heteroscedasticity

Visualizing Normality of Errors

Visualizing Uncorrelated Errors

Multiple Linear Regression

Challenges with Multiple Linear Regression

Addressing Multicollinearity

Outliers - Explained in Simple Terms

Visual Example

Why Outliers Matter

Types of Outliers

How to Spot Outliers

What to Do with Outliers

The Birthday Party Example

Common Regression Metrics Explained with Real Examples for Evaluating the Model

1. Mean Absolute Error (MAE)

The Formula

How to Calculate It (Step-by-Step)

Key Characteristics of MAE

Setting the Context: House Price Prediction Model

Calculating MAE - It's Just Finding the Average!

What This Means

Why MAE is Cool

2. Mean Squared Error (MSE)

The Formula

How to Calculate It (Step-by-Step)

Key Characteristics of MSE

Why Square the Numbers?

The Math

The Video Game Analogy

Why This is Weird but Useful

Real-Life Example

MSE vs MAE

3. Root Mean Square Error (RMSE)

The Formula

How to Calculate It (Step-by-Step)

Key Characteristics of RMSE

The Magic of the Square Root

The Report Card Analogy

Why RMSE is Higher Than MAE

The Weather Forecast Example

RMSE in Simple Words

The Birthday Party Analogy

Why Use RMSE?

4. R-squared (R²)

R-squared (R²), the Coefficient of Determination

The Formula

How to Interpret It

Key Characteristics of R-squared

The Dart Board Analogy

What R² Actually Measures

The Test Score Example

Back to Our House Prices

The Scale: 0 to 1 (or 0% to 100%)

Which Metric to Choose?

Overfitting and Underfitting in Linear Regression - Explained Simply

The Three Bears of Model Fitting

Underfitting - The Model That's Too Simple

Overfitting - The Model That's Too Perfect

Just Right - Good Fit

The Birthday Party Planning Example

Real-World Consequences

How to Detect

The Study Guide Analogy

Fixing the Problems

The Key Insight

1) Linear Regression in Neural Networks

As the Basic Building Block

In Output Layers for Regression Tasks

Single-Layer Networks

During Gradient Descent

2) Non-Linear (Complex) Regression in Neural Networks

Activation Functions Create Non-Linearity

Deep Networks = Complex Non-Linear Functions