Skip to main content

Classification Metrics - Confusion Matrix, Precision, Recall to ROC Curves


Topics:

A. Classification Metrics

B. Class Imbalance


A. Classification Metrics

Precision and recall are key metrics used to evaluate a machine learning model's performance, calculated using a confusion matrix. Precision measures the ratio of correctly predicted positive observations to the total number of positive predictions, answering "Of all the times the model predicted 'yes,' how often was it correct?". Recall measures the ratio of correctly predicted positive observations to all actual positive observations, answering "Of all the actual positive cases, how many did the model find?".  

Lets cover these topics 

  1. "The Building Blocks: Understanding TP, TN, FP, and FN"
    • Start with the foundation
    • Use real examples (email spam, medical tests)
  2. "The Confusion Matrix: Your Performance Dashboard"
    • Visual representation of the building blocks
    • How to read and interpret it
  3. "Accuracy: The Misleading Metric"
    • Why everyone starts here
    • Why it often fails (the 99% accuracy trap)
  4. "Precision: When False Alarms Are Costly"
    • "Of all my positive predictions, how many were correct?"
    • When to optimize for precision
  5. "Recall (Sensitivity): When Missing Cases Is Dangerous"
    • "Of all actual positives, how many did I catch?"
    • When to optimize for recall
  6. "F1 Score: The Balanced Compromise"
    • Harmonic mean explained simply
    • When F1 is and isn't appropriate
  7. "ROC Curve & AUC: The Big Picture"
    • Performance across all thresholds
    • What AUC really means
  8. "Precision-Recall Curve: The Better Choice for Imbalanced Data"
    • When to use instead of ROC
    • Real-world applications

Important:

  1. F1 Score

    • Harmonic mean of precision and recall
    • When and why to use it
    • F-beta score variations
  2. Accuracy (and its limitations)

    • Why accuracy can be misleading
    • The accuracy paradox with imbalanced datasets
  3. True/False Positives and Negatives (TP, TN, FP, FN)

    • Clear definitions with examples
    • How they build the confusion matrix
  4. ROC Curve and AUC

    • Receiver Operating Characteristic curve
    • Area Under the Curve
    • Visual interpretation
  5. Precision-Recall Curve

    • When it's better than ROC
    • Especially for imbalanced datasets

Important Supporting Topics:

  1. Class Imbalance Problem

    • Why it matters for metrics
    • Which metrics are robust to imbalance
  2. Threshold Tuning

    • Moving beyond default 0.5
    • Trading off precision vs recall
  3. Multi-class Classification Metrics

    • Macro vs Micro vs Weighted averaging
    • One-vs-Rest approach
  4. Specificity and Sensitivity

    • Medical/diagnostic terminology
    • Relationship to recall and precision
  5. Real-World Examples

    • Medical diagnosis (cancer detection)
    • Spam filtering
    • Fraud detection
    • Each showing different metric priorities

Advanced But Useful:

  1. Matthews Correlation Coefficient (MCC)

    • Balanced measure even for imbalanced classes
  2. Cohen's Kappa

    • Agreement beyond chance
  3. Cost-Sensitive Learning

    • When false positives/negatives have different costs

Visual Elements to Include:

  • Confusion matrix heatmaps
  • Precision-Recall trade-off graphs
  • ROC curve comparisons
  • Interactive threshold slider (if possible)

Common Pitfalls Section:

  • Why 99% accuracy might be terrible
  • When to optimize for precision vs recall
  • The base rate fallacy


B. Class Imbalance

What is Class Imbalance?

Class imbalance occurs when one class has significantly more samples than another in your dataset.

Example:

  • Credit card fraud: 99.8% legitimate transactions, 0.2% fraud
  • Disease detection: 95% healthy patients, 5% diseased
  • Email classification: 90% normal emails, 10% spam

Why is it a Problem?

The Accuracy Trap:

Dataset: 990 normal transactions, 10 fraudulent
Model predicts: "Everything is normal"
Accuracy: 99%! 
But... caught 0 frauds 

The model looks great (99% accurate) but is completely useless - it never catches fraud!

Problems Caused by Imbalance:

  1. Model bias - Algorithms favor the majority class
  2. Poor minority class detection - Rare events get ignored
  3. Misleading metrics - High accuracy hides poor performance
  4. Learning difficulty - Not enough minority examples to learn patterns

Techniques to Handle Class Imbalance

1. Resampling Techniques

A. Oversampling (Increase Minority Class)

  • Random Oversampling: Duplicate minority samples
  • SMOTE: Generate synthetic samples (see detailed explanation below)
  • ADASYN: Adaptive synthetic sampling

B. Undersampling (Reduce Majority Class)

  • Random Undersampling: Remove majority samples randomly
  • Tomek Links: Remove borderline majority samples
  • Edited Nearest Neighbors: Remove noisy samples

C. Combination Methods

  • SMOTEENN: SMOTE + Edited Nearest Neighbors
  • SMOTETomek: SMOTE + Tomek Links

2. Algorithm-Level Approaches

Class Weight Adjustment:

# In sklearn
model = LogisticRegression(class_weight='balanced')
# Automatically adjusts weights inversely proportional to class frequencies

Cost-Sensitive Learning:

  • Assign higher penalty for misclassifying minority class
  • Custom loss functions

3. Ensemble Methods

  • BalancedRandomForest: Balances each bootstrap sample
  • EasyEnsemble: Multiple undersampled subsets
  • RUSBoost: Boosting with undersampling

4. Metric Selection

Instead of accuracy, use:

  • Precision, Recall, F1-Score
  • Area Under Precision-Recall Curve
  • Matthews Correlation Coefficient
  • Balanced Accuracy

5. Threshold Moving

  • Adjust decision threshold (not always 0.5)
  • Optimize for business objectives

SMOTE (Synthetic Minority Over-sampling Technique)

What is SMOTE?

SMOTE creates synthetic samples of the minority class rather than just duplicating existing ones.

How SMOTE Works:

Step-by-step Process:

  1. Select a minority sample

    Point A: [age=25, income=30K]
    
  2. Find k nearest neighbors (typically k=5)

    Neighbor B: [age=27, income=32K]
    Neighbor C: [age=23, income=28K]
    
  3. Randomly choose one neighbor

    Selected: Neighbor B
    
  4. Create synthetic sample along the line between them

    Random factor λ = 0.7
    Synthetic = A + λ × (B - A)
    New point: [age=26.4, income=31.4K]
    
  5. Repeat until desired balance

Visual Example:

Before SMOTE:           After SMOTE:
○ ○ ○ ○ ○              ○ ○ ○ ○ ○
○ ○ ○ ○ ○              ○ ○ ○ ○ ○
  ●                    ● ◆ ●
    ●                  ◆ ● ◆
                       ◆   ◆

○ = Majority class
● = Original minority
◆ = Synthetic samples

SMOTE Code Example:

from imblearn.over_sampling import SMOTE

# Original imbalanced data
X_train, y_train  # 990 normal, 10 fraud

# Apply SMOTE
smote = SMOTE(sampling_strategy='auto', k_neighbors=5)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Now balanced: 990 normal, 990 fraud (synthetic included)

SMOTE Advantages:

  • Creates diverse synthetic samples
  • Reduces overfitting vs simple duplication
  • Helps decision boundary

SMOTE Limitations:

  • Can create noisy samples if minority samples are outliers
  • Doesn't consider majority class distribution
  • May not work well with high-dimensional data

SMOTE Variants:

  1. BorderlineSMOTE: Focus on borderline minority samples
  2. SVMSMOTE: Uses SVM to find support vectors
  3. KMeansSMOTE: Clusters before applying SMOTE
  4. SMOTE-NC: Handles categorical features

Best Practices for Imbalanced Data

Recommended Workflow:

  1. Start with proper metrics (not accuracy)
  2. Try algorithm-level solutions first (class weights)
  3. Experiment with resampling if needed
  4. Consider ensemble methods for best results
  5. Always validate on original distribution

Common Mistakes to Avoid:

  1. Don't apply SMOTE before split - Apply only to training data
  2. Don't evaluate on balanced data - Test on original distribution
  3. Don't ignore domain knowledge - Sometimes imbalance reflects reality
  4. Don't balance unnecessarily - Slight imbalance (60:40) often okay

Quick Decision Guide:

Imbalance Ratio Recommended Approach
< 1:3 Often no action needed
1:3 to 1:10 Class weights or simple resampling
1:10 to 1:100 SMOTE + ensemble methods
> 1:100 Anomaly detection approach

Remember: The goal isn't perfect balance, but better minority class detection while maintaining overall performance!

L. Understanding Binary Classification Outcomes - False Negative/False Positive/Etc.

  • True Positive: The model predicts Positive and the reality is Positive. It correctly identified the specific condition or target you wanted.

  • False Positive (Type I Error): The model predicts Positive, but reality is Negative. A "False Alarm" where it incorrectly flags something harmless as the target.

  • True Negative: The model predicts Negative and the reality is Negative. It correctly recognized that the target condition was not present.

  • False Negative (Type II Error): The model predicts Negative, but reality is Positive. A "Miss" where the model completely failed to catch the target condition.

Think of it Like a Metal Detector at School:

Your school has a metal detector at the entrance to catch weapons. Every backpack gets scanned:

  • Positive (1) = Weapon detected (the dangerous thing we're looking for)
  • Negative (0) = Safe backpack (normal school supplies)

The Four Possible Outcomes:

1. TRUE POSITIVE - "Good Catch!" 

  • Detector said: "Weapon!"
  • Reality: Kid HAD a knife
  • Result: Dangerous item stopped, school stays safe ✓

2. TRUE NEGATIVE - "Clear to Go" 

  • Detector said: "All clear"
  • Reality: Just books and lunch
  • Result: Student walks through normally ✓

3. FALSE POSITIVE - "False Alarm!" 

  • Detector said: "Weapon!"
  • Reality: Metal ruler in geometry set
  • Result: Embarrassing bag search, late to class

4. FALSE NEGATIVE - "Totally Missed It" 

  • Detector said: "All clear"
  • Reality: Ceramic knife went through
  • Result: Dangerous item got into school

The Easy Memory Trick:

Second word = What the detector beeped:

  • Positive = BEEP! (Alert!)
  • Negative = Silence (All clear)

First word = Was it right?

  • True = Correct call ✓
  • False = Wrong call ✗

Why Different Mistakes Matter:

Medical Test for Strep Throat:

  • False Negative = Send sick kid to school (infects everyone) 🤒
  • False Positive = Take antibiotics unnecessarily (not ideal but safer)

Face ID on Your Phone:

  • False Negative = Won't unlock for YOU (annoying!)
  • False Positive = Unlocks for stranger (security breach!)

The Confusion Matrix:

It's just a 2×2 box showing these four outcomes - like a report card for your model showing where it gets "confused" between real threats and false alarms!

The goal? Maximize the "True" ones and minimize the "False" ones - but sometimes one type of mistake is WAY worse than the other!

L2. Why Your Model's 99% Accuracy Might Be Lying to You: Understanding Precision, Recall, Confusion Matrix and F1-Score 


Confusion Matrix: A 2×2 grid showing four possible outcomes when predicting binary (yes/no) results:

  • True Positive (TP): Correctly predicted the positive class
  • True Negative (TN): Correctly predicted the negative class
  • False Positive (FP): Wrongly predicted positive (false alarm)
  • False Negative (FN): Wrongly predicted negative (missed it)

Accuracy: (TP + TN) / Total predictions - the percentage you got right overall.

Precision: TP / (TP + FP) - Of everything you called positive, what percentage was actually positive? Answers: "How trustworthy are my positive predictions?"

Recall (Sensitivity): TP / (TP + FN) - Of all actual positives, what percentage did you catch? Answers: "Did I find all the important cases?"

The 99% Accuracy Trap:

Imagine detecting credit card fraud where only 1% of transactions are fraudulent. A model that predicts "not fraud" for everything achieves 99% accuracy but catches zero fraud - completely useless!

This model has:

  • Accuracy: 99% ✓ (looks amazing!)
  • Precision: Undefined (never predicts fraud)
  • Recall: 0% ✗ (catches no fraud)

The lesson: In imbalanced datasets, accuracy hides failure. Precision tells you about false alarms, while recall reveals what you're missing. Always check all three metrics - especially when one class is rare but important.

What is F1 Score?

F1 Score is the harmonic mean of Precision and Recall:

Formula: F1 = 2 × (Precision × Recall) / (Precision + Recall)

Example Data:


  • All TP, TN, FP, FN classifications are shown
  • Counts are (TP:6, TN:7, FP:3, FN:4)
  • Accuracy calculation: 13/20 = 0.65 (65%)
  • Precision calculation: 6/9 = 0.67 (67%)
  • Recall calculation is spot: 6/10 = 0.60 (60%)

F1 Score calculation: 2 × 0.402 / 1.27 = 0.804 / 1.27 = 0.633

Tips for color coding (not needed but helps):

  1. Color coding Green for correct (TP/TN), Red for errors (FP/FN)
  2. Add intuitive explanations:
    • Precision (67%): "When we say someone has the disease, we're right 2 out of 3 times"
    • Recall (60%): "We catch 6 out of 10 people who actually have the disease"
    • F1 (63%): "Overall balance between precision and recall"
  3. Real-world interpretation:
    • 4 sick patients were sent home (FN) - dangerous!
    • 3 healthy patients were told they're sick (FP) - stressful but safer

The 65% accuracy looks "okay" but missing 40% of sick patients (recall=60%) could be life-threatening in real medical scenarios.

Significance of F1 Score?

F1 Score is the harmonic mean of Precision and Recall:

Formula: F1 = 2 × (Precision × Recall) / (Precision + Recall)

Why Not Just Average Them?

The harmonic mean punishes extreme values. Consider two models:

Model A:

  • Precision: 100%, Recall: 10%
  • Simple Average: (100 + 10) / 2 = 55%
  • F1 Score: 2 × (1.0 × 0.1) / (1.0 + 0.1) = 18%

Model B:

  • Precision: 60%, Recall: 60%
  • Simple Average: 60%
  • F1 Score: 60%

Model A catches almost nothing despite perfect precision - F1 reveals this weakness!

When to Use F1:

F1 is ideal when:

  • You need balance between precision and recall
  • False positives and false negatives are equally bad
  • You have imbalanced classes

Real Example: In spam detection:

  • High Precision only = Few false alarms but miss lots of spam
  • High Recall only = Catch all spam but many false alarms
  • High F1 = Good balance - catches most spam with minimal false alarms

Think of F1 as: "How well-rounded is my model?" A score of 0.8+ means strong performance in BOTH precision and recall, not just one.

F1-Score harmonically balances precision and recall into a single metric. When you can't afford to optimize just one, F1 provides the sweet spot. It penalizes extreme imbalances - a model with perfect precision but terrible recall still gets a poor F1.

F1 Score Range

The Range: 0 to 1 (or 0% to 100%)

  • F1 = 0: Worst possible - either precision or recall (or both) is zero
  • F1 = 1: Perfect score - both precision AND recall are perfect (100%)

Interpreting F1 Scores:

F1 ScoreInterpretationReal-World Meaning
0.0 - 0.3PoorModel is failing badly
0.3 - 0.5Below AverageNeeds significant improvement
0.5 - 0.7AverageAcceptable for some use cases
0.7 - 0.8GoodSolid performance
0.8 - 0.9Very GoodStrong model
0.9 - 1.0ExcellentOutstanding (rare in practice)

Key Points:

  1. F1 can never exceed the lower of precision or recall

    • If Precision = 90% and Recall = 60%, F1 can't be higher than 72%
  2. F1 = 0 happens when:

    • Model predicts all negative (Precision undefined, Recall = 0)
    • Model predicts all positive for negative-only data (Precision = 0)
  3. F1 = 1 requires:

    • Precision = 100% (no false positives)
    • Recall = 100% (no false negatives)
    • Practically impossible in real-world problems

Typical good scores: Most production models achieve F1 scores between 0.6-0.85 depending on the problem difficulty.

All Three Metrics [Recall, Precision, F1 Score] Have the Same "Best" Value: 1 (or 100%)

Best F1 Score = 1

  • Means both precision and recall are perfect
  • Extremely rare in practice

Best Recall = 1

  • You caught ALL positive cases (no false negatives)
  • Example: Found all 100 cancer patients out of 100

Best Precision = 1

  • ALL your positive predictions were correct (no false positives)
  • Example: Every time you said "cancer," you were right

But Here's the Important Reality:

Getting all three to 1 is nearly impossible because:

  • Perfect Recall (1.0) often means being overly aggressive - calling many things positive to catch everything, which hurts precision
  • Perfect Precision (1.0) often means being overly conservative - only calling the super obvious cases positive, which hurts recall
  • Perfect F1 (1.0) requires BOTH to be perfect simultaneously

Real-World "Good" Scores:

  • Recall: 0.8-0.9 is excellent
  • Precision: 0.8-0.9 is excellent
  • F1: 0.7-0.85 is very good

The Trade-off:

Usually, you optimize for one based on your use case:

  • Medical screening: Maximize recall (catch all diseases)
  • Spam filtering: Balance both (F1)
  • Legal document classification: Maximize precision (avoid false accusations)

So yes, mathematically 1 is best for all three, but practically, you rarely achieve it!

ROC Curves and AUC Explained

What This Graph Shows:

ROC (Receiver Operating Characteristic) Curve plots:

  • X-axis: False Positive Rate (1 - Specificity) = FP/(FP+TN)
  • Y-axis: True Positive Rate (Recall/Sensitivity) = TP/(TP+FN)

Understanding the Lines:

  1. Diagonal Dotted Line = Random guessing (coin flip)

    • AUC = 0.5
    • Useless model
  2. Blue Curve = "Better model"

    • AUC = 0.9216
    • Excellent performance
  3. Orange Curve = "Worse model"

    • AUC = 0.9062
    • Still very good, but slightly worse

What AUC (Area Under Curve) Means:

  • AUC = 1.0: Perfect classifier
  • AUC = 0.9-1.0: Excellent (both models here!)
  • AUC = 0.8-0.9: Good
  • AUC = 0.7-0.8: Acceptable
  • AUC = 0.5: No better than random
  • AUC < 0.5: Worse than random (but flip predictions!)

Key Insights:

  1. The curves show ALL possible thresholds - not just 0.5

    • Each point = different threshold setting
    • Moving right = lower threshold (more positive predictions)
  2. Closer to top-left corner = Better

    • Top-left = 100% TPR, 0% FPR (perfect)
    • The blue curve reaches higher faster
  3. Why AUC Matters:

    • Single number comparing models
    • Threshold-independent (tests all cutoffs)
    • Works for imbalanced datasets

Practical meaning: The blue model (0.9216) correctly ranks a random positive example higher than a random negative example 92.16% of the time!


Comments

Popular posts from this blog

Simple Linear Regression - and Related Regression Loss Functions

Today's Topics: a. Regression Algorithms  b. Outliers - Explained in Simple Terms c. Common Regression Metrics Explained d. Overfitting and Underfitting e. How are Linear and Non Linear Regression Algorithms used in Neural Networks [Future study topics] Regression Algorithms Regression algorithms are a category of machine learning methods used to predict a continuous numerical value. Linear regression is a simple, powerful, and interpretable algorithm for this type of problem. Quick Example: These are the scores of students vs. the hours they spent studying. Looking at this dataset of student scores and their corresponding study hours, can we determine what score someone might achieve after studying for a random number of hours? Example: From the graph, we can estimate that 4 hours of daily study would result in a score near 80. It is a simple example, but for more complex tasks the underlying concept will be similar. If you understand this graph, you will understand this blog. Sim...

What problems can AI Neural Networks solve

How does AI Neural Networks solve Problems? What problems can AI Neural Networks solve? Based on effectiveness and common usage, here's the ranking from best to least suitable for neural networks (Classification Problems, Regression Problems and Optimization Problems.) But first some Math, background and related topics as how the Neural Network Learn by training (Supervised Learning and Unsupervised Learning.)  Background Note - Mathematical Precision vs. Practical AI Solutions. Math can solve all these problems with very accurate results. While Math can theoretically solve classification, regression, and optimization problems with perfect accuracy, such calculations often require impractical amounts of time—hours, days, or even years for complex real-world scenarios. In practice, we rarely need absolute precision; instead, we need actionable results quickly enough to make timely decisions. Neural networks excel at this trade-off, providing "good enough" solutions in seco...

Activation Functions in Neural Networks

  A Guide to Activation Functions in Neural Networks 🧠 Question: Without activation function can a neural network with many layers be non-linear? Answer: Provided at the end of this document. Activation functions are a crucial component of neural networks. Their primary purpose is to introduce non-linearity , which allows the network to learn the complex, winding patterns found in real-world data. Without them, a neural network, no matter how deep, would just be a simple linear model. In the diagram below the f is the activation function that receives input and send output to next layers. Commonly used activation functions. 1. Sigmoid Function 2. Tanh (Hyperbolic Tangent) 3. ReLU (Rectified Linear Unit - Like an Electronic Diode) 4. Leaky ReLU & PReLU 5. ELU (Exponential Linear Unit) 6. Softmax 7. GELU, Swish, and SiLU 1. Sigmoid Function                       The classic "S-curve," Sigmoid squashes any input value t...