Skip to main content

Classification Metrics - Confusion Matrix, Precision, Recall to ROC Curves


Topics:

A. Classification Metrics

B. Class Imbalance


A. Classification Metrics

Precision and recall are key metrics used to evaluate a machine learning model's performance, calculated using a confusion matrix. Precision measures the ratio of correctly predicted positive observations to the total number of positive predictions, answering "Of all the times the model predicted 'yes,' how often was it correct?". Recall measures the ratio of correctly predicted positive observations to all actual positive observations, answering "Of all the actual positive cases, how many did the model find?".  

Lets cover these topics 

  1. "The Building Blocks: Understanding TP, TN, FP, and FN"
    • Start with the foundation
    • Use real examples (email spam, medical tests)
  2. "The Confusion Matrix: Your Performance Dashboard"
    • Visual representation of the building blocks
    • How to read and interpret it
  3. "Accuracy: The Misleading Metric"
    • Why everyone starts here
    • Why it often fails (the 99% accuracy trap)
  4. "Precision: When False Alarms Are Costly"
    • "Of all my positive predictions, how many were correct?"
    • When to optimize for precision
  5. "Recall (Sensitivity): When Missing Cases Is Dangerous"
    • "Of all actual positives, how many did I catch?"
    • When to optimize for recall
  6. "F1 Score: The Balanced Compromise"
    • Harmonic mean explained simply
    • When F1 is and isn't appropriate
  7. "ROC Curve & AUC: The Big Picture"
    • Performance across all thresholds
    • What AUC really means
  8. "Precision-Recall Curve: The Better Choice for Imbalanced Data"
    • When to use instead of ROC
    • Real-world applications

Important:

  1. F1 Score

    • Harmonic mean of precision and recall
    • When and why to use it
    • F-beta score variations
  2. Accuracy (and its limitations)

    • Why accuracy can be misleading
    • The accuracy paradox with imbalanced datasets
  3. True/False Positives and Negatives (TP, TN, FP, FN)

    • Clear definitions with examples
    • How they build the confusion matrix
  4. ROC Curve and AUC

    • Receiver Operating Characteristic curve
    • Area Under the Curve
    • Visual interpretation
  5. Precision-Recall Curve

    • When it's better than ROC
    • Especially for imbalanced datasets

Important Supporting Topics:

  1. Class Imbalance Problem

    • Why it matters for metrics
    • Which metrics are robust to imbalance
  2. Threshold Tuning

    • Moving beyond default 0.5
    • Trading off precision vs recall
  3. Multi-class Classification Metrics

    • Macro vs Micro vs Weighted averaging
    • One-vs-Rest approach
  4. Specificity and Sensitivity

    • Medical/diagnostic terminology
    • Relationship to recall and precision
  5. Real-World Examples

    • Medical diagnosis (cancer detection)
    • Spam filtering
    • Fraud detection
    • Each showing different metric priorities

Advanced But Useful:

  1. Matthews Correlation Coefficient (MCC)

    • Balanced measure even for imbalanced classes
  2. Cohen's Kappa

    • Agreement beyond chance
  3. Cost-Sensitive Learning

    • When false positives/negatives have different costs

Visual Elements to Include:

  • Confusion matrix heatmaps
  • Precision-Recall trade-off graphs
  • ROC curve comparisons
  • Interactive threshold slider (if possible)

Common Pitfalls Section:

  • Why 99% accuracy might be terrible
  • When to optimize for precision vs recall
  • The base rate fallacy


B. Class Imbalance

What is Class Imbalance?

Class imbalance occurs when one class has significantly more samples than another in your dataset.

Example:

  • Credit card fraud: 99.8% legitimate transactions, 0.2% fraud
  • Disease detection: 95% healthy patients, 5% diseased
  • Email classification: 90% normal emails, 10% spam

Why is it a Problem?

The Accuracy Trap:

Dataset: 990 normal transactions, 10 fraudulent
Model predicts: "Everything is normal"
Accuracy: 99%! 
But... caught 0 frauds 

The model looks great (99% accurate) but is completely useless - it never catches fraud!

Problems Caused by Imbalance:

  1. Model bias - Algorithms favor the majority class
  2. Poor minority class detection - Rare events get ignored
  3. Misleading metrics - High accuracy hides poor performance
  4. Learning difficulty - Not enough minority examples to learn patterns

Techniques to Handle Class Imbalance

1. Resampling Techniques

A. Oversampling (Increase Minority Class)

  • Random Oversampling: Duplicate minority samples
  • SMOTE: Generate synthetic samples (see detailed explanation below)
  • ADASYN: Adaptive synthetic sampling

B. Undersampling (Reduce Majority Class)

  • Random Undersampling: Remove majority samples randomly
  • Tomek Links: Remove borderline majority samples
  • Edited Nearest Neighbors: Remove noisy samples

C. Combination Methods

  • SMOTEENN: SMOTE + Edited Nearest Neighbors
  • SMOTETomek: SMOTE + Tomek Links

2. Algorithm-Level Approaches

Class Weight Adjustment:

# In sklearn
model = LogisticRegression(class_weight='balanced')
# Automatically adjusts weights inversely proportional to class frequencies

Cost-Sensitive Learning:

  • Assign higher penalty for misclassifying minority class
  • Custom loss functions

3. Ensemble Methods

  • BalancedRandomForest: Balances each bootstrap sample
  • EasyEnsemble: Multiple undersampled subsets
  • RUSBoost: Boosting with undersampling

4. Metric Selection

Instead of accuracy, use:

  • Precision, Recall, F1-Score
  • Area Under Precision-Recall Curve
  • Matthews Correlation Coefficient
  • Balanced Accuracy

5. Threshold Moving

  • Adjust decision threshold (not always 0.5)
  • Optimize for business objectives

SMOTE (Synthetic Minority Over-sampling Technique)

What is SMOTE?

SMOTE creates synthetic samples of the minority class rather than just duplicating existing ones.

How SMOTE Works:

Step-by-step Process:

  1. Select a minority sample

    Point A: [age=25, income=30K]
    
  2. Find k nearest neighbors (typically k=5)

    Neighbor B: [age=27, income=32K]
    Neighbor C: [age=23, income=28K]
    
  3. Randomly choose one neighbor

    Selected: Neighbor B
    
  4. Create synthetic sample along the line between them

    Random factor λ = 0.7
    Synthetic = A + λ × (B - A)
    New point: [age=26.4, income=31.4K]
    
  5. Repeat until desired balance

Visual Example:

Before SMOTE:           After SMOTE:
○ ○ ○ ○ ○              ○ ○ ○ ○ ○
○ ○ ○ ○ ○              ○ ○ ○ ○ ○
  ●                    ● ◆ ●
    ●                  ◆ ● ◆
                       ◆   ◆

○ = Majority class
● = Original minority
◆ = Synthetic samples

SMOTE Code Example:

from imblearn.over_sampling import SMOTE

# Original imbalanced data
X_train, y_train  # 990 normal, 10 fraud

# Apply SMOTE
smote = SMOTE(sampling_strategy='auto', k_neighbors=5)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Now balanced: 990 normal, 990 fraud (synthetic included)

SMOTE Advantages:

  • Creates diverse synthetic samples
  • Reduces overfitting vs simple duplication
  • Helps decision boundary

SMOTE Limitations:

  • Can create noisy samples if minority samples are outliers
  • Doesn't consider majority class distribution
  • May not work well with high-dimensional data

SMOTE Variants:

  1. BorderlineSMOTE: Focus on borderline minority samples
  2. SVMSMOTE: Uses SVM to find support vectors
  3. KMeansSMOTE: Clusters before applying SMOTE
  4. SMOTE-NC: Handles categorical features

Best Practices for Imbalanced Data

Recommended Workflow:

  1. Start with proper metrics (not accuracy)
  2. Try algorithm-level solutions first (class weights)
  3. Experiment with resampling if needed
  4. Consider ensemble methods for best results
  5. Always validate on original distribution

Common Mistakes to Avoid:

  1. Don't apply SMOTE before split - Apply only to training data
  2. Don't evaluate on balanced data - Test on original distribution
  3. Don't ignore domain knowledge - Sometimes imbalance reflects reality
  4. Don't balance unnecessarily - Slight imbalance (60:40) often okay

Quick Decision Guide:

Imbalance Ratio Recommended Approach
< 1:3 Often no action needed
1:3 to 1:10 Class weights or simple resampling
1:10 to 1:100 SMOTE + ensemble methods
> 1:100 Anomaly detection approach

Remember: The goal isn't perfect balance, but better minority class detection while maintaining overall performance!

L. Understanding Binary Classification Outcomes - False Negative/False Positive/Etc.

  • True Positive: The model predicts Positive and the reality is Positive. It correctly identified the specific condition or target you wanted.

  • False Positive (Type I Error): The model predicts Positive, but reality is Negative. A "False Alarm" where it incorrectly flags something harmless as the target.

  • True Negative: The model predicts Negative and the reality is Negative. It correctly recognized that the target condition was not present.

  • False Negative (Type II Error): The model predicts Negative, but reality is Positive. A "Miss" where the model completely failed to catch the target condition.

Think of it Like a Metal Detector at School:

Your school has a metal detector at the entrance to catch weapons. Every backpack gets scanned:

  • Positive (1) = Weapon detected (the dangerous thing we're looking for)
  • Negative (0) = Safe backpack (normal school supplies)

The Four Possible Outcomes:

1. TRUE POSITIVE - "Good Catch!" 

  • Detector said: "Weapon!"
  • Reality: Kid HAD a knife
  • Result: Dangerous item stopped, school stays safe ✓

2. TRUE NEGATIVE - "Clear to Go" 

  • Detector said: "All clear"
  • Reality: Just books and lunch
  • Result: Student walks through normally ✓

3. FALSE POSITIVE - "False Alarm!" 

  • Detector said: "Weapon!"
  • Reality: Metal ruler in geometry set
  • Result: Embarrassing bag search, late to class

4. FALSE NEGATIVE - "Totally Missed It" 

  • Detector said: "All clear"
  • Reality: Ceramic knife went through
  • Result: Dangerous item got into school

The Easy Memory Trick:

Second word = What the detector beeped:

  • Positive = BEEP! (Alert!)
  • Negative = Silence (All clear)

First word = Was it right?

  • True = Correct call ✓
  • False = Wrong call ✗

Why Different Mistakes Matter:

Medical Test for Strep Throat:

  • False Negative = Send sick kid to school (infects everyone) 🤒
  • False Positive = Take antibiotics unnecessarily (not ideal but safer)

Face ID on Your Phone:

  • False Negative = Won't unlock for YOU (annoying!)
  • False Positive = Unlocks for stranger (security breach!)

The Confusion Matrix:

It's just a 2×2 box showing these four outcomes - like a report card for your model showing where it gets "confused" between real threats and false alarms!

The goal? Maximize the "True" ones and minimize the "False" ones - but sometimes one type of mistake is WAY worse than the other!

L2. Why Your Model's 99% Accuracy Might Be Lying to You: Understanding Precision, Recall, Confusion Matrix and F1-Score 


Confusion Matrix: A 2×2 grid showing four possible outcomes when predicting binary (yes/no) results:

  • True Positive (TP): Correctly predicted the positive class
  • True Negative (TN): Correctly predicted the negative class
  • False Positive (FP): Wrongly predicted positive (false alarm)
  • False Negative (FN): Wrongly predicted negative (missed it)

Accuracy: (TP + TN) / Total predictions - the percentage you got right overall.

Precision: TP / (TP + FP) - Of everything you called positive, what percentage was actually positive? Answers: "How trustworthy are my positive predictions?"

Recall (Sensitivity): TP / (TP + FN) - Of all actual positives, what percentage did you catch? Answers: "Did I find all the important cases?"

The 99% Accuracy Trap:

Imagine detecting credit card fraud where only 1% of transactions are fraudulent. A model that predicts "not fraud" for everything achieves 99% accuracy but catches zero fraud - completely useless!

This model has:

  • Accuracy: 99% ✓ (looks amazing!)
  • Precision: Undefined (never predicts fraud)
  • Recall: 0% ✗ (catches no fraud)

The lesson: In imbalanced datasets, accuracy hides failure. Precision tells you about false alarms, while recall reveals what you're missing. Always check all three metrics - especially when one class is rare but important.

What is F1 Score?

F1 Score is the harmonic mean of Precision and Recall:

Formula: F1 = 2 × (Precision × Recall) / (Precision + Recall)

Example Data:


  • All TP, TN, FP, FN classifications are shown
  • Counts are (TP:6, TN:7, FP:3, FN:4)
  • Accuracy calculation: 13/20 = 0.65 (65%)
  • Precision calculation: 6/9 = 0.67 (67%)
  • Recall calculation is spot: 6/10 = 0.60 (60%)

F1 Score calculation: 2 × 0.402 / 1.27 = 0.804 / 1.27 = 0.633

Tips for color coding (not needed but helps):

  1. Color coding Green for correct (TP/TN), Red for errors (FP/FN)
  2. Add intuitive explanations:
    • Precision (67%): "When we say someone has the disease, we're right 2 out of 3 times"
    • Recall (60%): "We catch 6 out of 10 people who actually have the disease"
    • F1 (63%): "Overall balance between precision and recall"
  3. Real-world interpretation:
    • 4 sick patients were sent home (FN) - dangerous!
    • 3 healthy patients were told they're sick (FP) - stressful but safer

The 65% accuracy looks "okay" but missing 40% of sick patients (recall=60%) could be life-threatening in real medical scenarios.

Significance of F1 Score?

F1 Score is the harmonic mean of Precision and Recall:

Formula: F1 = 2 × (Precision × Recall) / (Precision + Recall)

Why Not Just Average Them?

The harmonic mean punishes extreme values. Consider two models:

Model A:

  • Precision: 100%, Recall: 10%
  • Simple Average: (100 + 10) / 2 = 55%
  • F1 Score: 2 × (1.0 × 0.1) / (1.0 + 0.1) = 18%

Model B:

  • Precision: 60%, Recall: 60%
  • Simple Average: 60%
  • F1 Score: 60%

Model A catches almost nothing despite perfect precision - F1 reveals this weakness!

When to Use F1:

F1 is ideal when:

  • You need balance between precision and recall
  • False positives and false negatives are equally bad
  • You have imbalanced classes

Real Example: In spam detection:

  • High Precision only = Few false alarms but miss lots of spam
  • High Recall only = Catch all spam but many false alarms
  • High F1 = Good balance - catches most spam with minimal false alarms

Think of F1 as: "How well-rounded is my model?" A score of 0.8+ means strong performance in BOTH precision and recall, not just one.

F1-Score harmonically balances precision and recall into a single metric. When you can't afford to optimize just one, F1 provides the sweet spot. It penalizes extreme imbalances - a model with perfect precision but terrible recall still gets a poor F1.

F1 Score Range

The Range: 0 to 1 (or 0% to 100%)

  • F1 = 0: Worst possible - either precision or recall (or both) is zero
  • F1 = 1: Perfect score - both precision AND recall are perfect (100%)

Interpreting F1 Scores:

F1 ScoreInterpretationReal-World Meaning
0.0 - 0.3PoorModel is failing badly
0.3 - 0.5Below AverageNeeds significant improvement
0.5 - 0.7AverageAcceptable for some use cases
0.7 - 0.8GoodSolid performance
0.8 - 0.9Very GoodStrong model
0.9 - 1.0ExcellentOutstanding (rare in practice)

Key Points:

  1. F1 can never exceed the lower of precision or recall

    • If Precision = 90% and Recall = 60%, F1 can't be higher than 72%
  2. F1 = 0 happens when:

    • Model predicts all negative (Precision undefined, Recall = 0)
    • Model predicts all positive for negative-only data (Precision = 0)
  3. F1 = 1 requires:

    • Precision = 100% (no false positives)
    • Recall = 100% (no false negatives)
    • Practically impossible in real-world problems

Typical good scores: Most production models achieve F1 scores between 0.6-0.85 depending on the problem difficulty.

All Three Metrics [Recall, Precision, F1 Score] Have the Same "Best" Value: 1 (or 100%)

Best F1 Score = 1

  • Means both precision and recall are perfect
  • Extremely rare in practice

Best Recall = 1

  • You caught ALL positive cases (no false negatives)
  • Example: Found all 100 cancer patients out of 100

Best Precision = 1

  • ALL your positive predictions were correct (no false positives)
  • Example: Every time you said "cancer," you were right

But Here's the Important Reality:

Getting all three to 1 is nearly impossible because:

  • Perfect Recall (1.0) often means being overly aggressive - calling many things positive to catch everything, which hurts precision
  • Perfect Precision (1.0) often means being overly conservative - only calling the super obvious cases positive, which hurts recall
  • Perfect F1 (1.0) requires BOTH to be perfect simultaneously

Real-World "Good" Scores:

  • Recall: 0.8-0.9 is excellent
  • Precision: 0.8-0.9 is excellent
  • F1: 0.7-0.85 is very good

The Trade-off:

Usually, you optimize for one based on your use case:

  • Medical screening: Maximize recall (catch all diseases)
  • Spam filtering: Balance both (F1)
  • Legal document classification: Maximize precision (avoid false accusations)

So yes, mathematically 1 is best for all three, but practically, you rarely achieve it!

ROC Curves and AUC Explained

What This Graph Shows:

ROC (Receiver Operating Characteristic) Curve plots:

  • X-axis: False Positive Rate (1 - Specificity) = FP/(FP+TN)
  • Y-axis: True Positive Rate (Recall/Sensitivity) = TP/(TP+FN)

Understanding the Lines:

  1. Diagonal Dotted Line = Random guessing (coin flip)

    • AUC = 0.5
    • Useless model
  2. Blue Curve = "Better model"

    • AUC = 0.9216
    • Excellent performance
  3. Orange Curve = "Worse model"

    • AUC = 0.9062
    • Still very good, but slightly worse

What AUC (Area Under Curve) Means:

  • AUC = 1.0: Perfect classifier
  • AUC = 0.9-1.0: Excellent (both models here!)
  • AUC = 0.8-0.9: Good
  • AUC = 0.7-0.8: Acceptable
  • AUC = 0.5: No better than random
  • AUC < 0.5: Worse than random (but flip predictions!)

Key Insights:

  1. The curves show ALL possible thresholds - not just 0.5

    • Each point = different threshold setting
    • Moving right = lower threshold (more positive predictions)
  2. Closer to top-left corner = Better

    • Top-left = 100% TPR, 0% FPR (perfect)
    • The blue curve reaches higher faster
  3. Why AUC Matters:

    • Single number comparing models
    • Threshold-independent (tests all cutoffs)
    • Works for imbalanced datasets

Practical meaning: The blue model (0.9216) correctly ranks a random positive example higher than a random negative example 92.16% of the time!


ROC Curves and AUC

The Core Problem

When you have a binary classifier that outputs probabilities (say, predicting whether an email is spam), you need to choose a threshold to convert those probabilities into yes/no decisions. At threshold 0.5, anything above becomes "positive." But what if you lower it to 0.3? You'll catch more true positives but also more false positives.

The ROC curve captures this entire trade-off in one picture.

Building the ROC Curve

For any threshold, you can compute two rates:

True Positive Rate (Sensitivity/Recall): $$TPR = \frac{TP}{TP + FN}$$ "Of all actual positives, how many did we catch?"

False Positive Rate: $$FPR = \frac{FP}{FP + TN}$$ "Of all actual negatives, how many did we incorrectly flag?"

The ROC curve plots TPR (y-axis) vs FPR (x-axis) as you sweep the threshold from 1.0 down to 0.0.

Interpreting the Curve

  • Bottom-left corner (0,0): Threshold = 1.0. You predict nothing as positive. TPR = 0, FPR = 0.
  • Top-right corner (1,1): Threshold = 0.0. You predict everything as positive. TPR = 1, FPR = 1.
  • Top-left corner (0,1): The ideal point—perfect classification.

A diagonal line from (0,0) to (1,1) represents a random classifier (coin flip). Any useful model should bow toward the top-left, away from this diagonal.

AUC: Area Under the Curve

The AUC summarizes the entire ROC curve into a single number between 0 and 1.

Probabilistic interpretation: If you randomly pick one positive and one negative sample, the AUC equals the probability that your model ranks the positive sample higher than the negative one.

AUC Value Interpretation
1.0 Perfect classifier
0.9–1.0 Excellent
0.8–0.9 Good
0.7–0.8 Fair
0.5 Random guessing
< 0.5 Worse than random (predictions inverted)

Why Use ROC-AUC?

  1. Threshold-independent: Evaluates the model's ranking ability across all possible operating points.

  2. Class imbalance resilience: Unlike accuracy, ROC-AUC isn't fooled by imbalanced datasets. If 99% of emails are non-spam, a model predicting "not spam" always gets 99% accuracy but AUC = 0.5.

  3. Comparing models: Two models with different optimal thresholds can be directly compared via AUC.

When ROC-AUC Falls Short

  • Severe class imbalance: When negatives vastly outnumber positives, even small FPRs can mean many false alarms in absolute terms. Precision-Recall curves are often better here.

  • Cost-sensitive applications: If the costs of false positives and false negatives differ greatly, you might care more about specific regions of the curve than the overall area.

Quick Example

Imagine a disease screening model:

  • At threshold 0.8: Catches 60% of sick patients (TPR=0.6), flags 5% of healthy ones (FPR=0.05)
  • At threshold 0.3: Catches 95% of sick patients (TPR=0.95), but flags 30% of healthy ones (FPR=0.30)

The ROC curve shows you this entire spectrum, letting clinicians choose based on whether missing a case or unnecessary testing is more costly.





ROC AUC (Receiver Operating Characteristic Area Under the Curve) is a key metric in machine learning for evaluating binary classification models, showing how well a model distinguishes between classes (like spam/not spam, disease/no disease) across all possible thresholds. The ROC curve plots the True Positive Rate (Sensitivity) against the False Positive Rate (1-Specificity) at various cutoffs, while the AUC is the single number summarizing this curve, representing the probability that the model ranks a random positive example higher than a random negative one (0.5=random, 1.0=perfect). [1, 2, 3, 4]


How it works
  • ROC Curve: A graph that visualizes a model's performance by plotting Sensitivity (True Positive Rate, TPR) on the y-axis against 1-Specificity (False Positive Rate, FPR) on the x-axis, at different classification thresholds.
  • AUC (Area Under the Curve): The area under the ROC curve, giving a single value between 0 and 1.
    • AUC = 1.0: A perfect classifier.
    • AUC = 0.5: No better than random guessing.
    • AUC &gt; 0.7: Generally considered acceptable. [1, 2, 3, 4, 5, 6]
What it tells you
  • Discriminative Power: It quantifies how well the model can tell the difference between positive and negative classes.
  • Threshold Independence: It provides a single score that summarizes performance across all possible thresholds, making it useful for comparing models without picking a specific cutoff.
  • Probability: It can be interpreted as the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative one. [1, 2, 3, 7]
Why it's important
  • It's a standard metric in fields like medical diagnosis and fraud detection because it balances sensitivity and specificity, providing a comprehensive view of a model's accuracy. [1, 4, 8]


AI responses may include mistakes.



See: https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc#exercise_check_your_understanding

Also see: https://www.youtube.com/watch?v=4jRBRDbJemM

Another Blog: https://milindai.blogspot.com/2025/12/roc-and-auc-explained.html

The ROC curve is a visual representation of model performance across all thresholds. The long version of the name, receiver operating characteristic, is a holdover from WWII radar detection.

The ROC curve is drawn by calculating the true positive rate (TPR) and false positive rate (FPR) at every possible threshold (in practice, at selected intervals), then graphing TPR over FPR. A perfect model, which at some threshold has a TPR of 1.0 and a FPR of 0.0, can be represented by either a point at (0, 1) if all other thresholds are ignored, or by the following:

Figure 1. A graph of TPR (y-axis) against FPR (x-axis) showing the
            performance of a perfect model: a line from (0,1) to (1,1).
Figure 1. ROC and AUC of a hypothetical perfect model.

The area under the ROC curve (AUC) represents the probability that the model, if given a randomly chosen positive and negative example, will rank the positive higher than the negative.

The perfect model above, containing a square with sides of length 1, has an area under the curve (AUC) of 1.0. This means there is a 100% probability that the model will correctly rank a randomly chosen positive example higher than a randomly chosen negative example. In other words, looking at the spread of data points below, AUC gives the probability that the model will place a randomly chosen square to the right of a randomly chosen circle, independent of where the threshold is set.

Figure 2. Visualization of a classifier with AUC = 1.0, where all positive examples are ranked to the right of negative examples.
Figure 2. A spread of predictions for a binary classification model. AUC is the chance a randomly chosen square is positioned to the right of a randomly chosen circle.

In more concrete terms, a spam classifier with AUC of 1.0 always assigns a random spam email a higher probability of being spam than a random legitimate email. The actual classification of each email depends on the threshold that you choose.

For a binary classifier, a model that does exactly as well as random guesses or coin flips has a ROC that is a diagonal line from (0,0) to (1,1). The AUC is 0.5, representing a 50% probability of correctly ranking a random positive and negative example.

In the spam classifier example, a spam classifier with AUC of 0.5 assigns a random spam email a higher probability of being spam than a random legitimate email only half the time.

Figure 3. A graph of TPR (y-axis) against FPR (x-axis) showing the
            performance of a random 50-50 guesser: a diagonal line from (0,0)
            to (1,1).
Figure 3. ROC and AUC of completely random guesses.

AUC is a useful measure for comparing the performance of two different models, as long as the dataset is roughly balanced. The model with greater area under the curve is generally the better one.

Figure 4.a. ROC/AUC graph of a model with AUC=0.65. Figure 4.b. ROC/AUC graph of a model with AUC=0.93.
Figure 4. ROC and AUC of two hypothetical models. The curve on the right, with a greater AUC, represents the better of the two models.

The points on a ROC curve closest to (0,1) represent a range of the best-performing thresholds for the given model. As discussed in the ThresholdsConfusion matrix and Choice of metric and tradeoffs sections, the threshold you choose depends on which metric is most important to the specific use case. Consider the points A, B, and C in the following diagram, each representing a threshold:

Figure 5. A ROC curve of AUC=0.84 showing three points on the
            convex part of the curve closest to (0,1) labeled A, B, C in order.
Figure 5. Three labeled points representing thresholds.

If false positives (false alarms) are highly costly, it may make sense to choose a threshold that gives a lower FPR, like the one at point A, even if TPR is reduced. Conversely, if false positives are cheap and false negatives (missed true positives) highly costly, the threshold for point C, which maximizes TPR, may be preferable. If the costs are roughly equivalent, point B may offer the best balance between TPR and FPR.

Here is the ROC curve for the data we have seen before:

See:

https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

Receiver-operating characteristic curve (ROC)

Area under the ROC curve (AUC) 

Exercise: Check your understanding

In practice, ROC curves are much less regular than the illustrations given above. Which of the following models, represented by their ROC curve and AUC, has the best performance?
ROC curve that zig-zags up and to the right from (0,0) to (1,1).
           The curve has an AUC of 0.623.
ROC curve that arcs upward and then rightward from (0,0) to
           (1,1). The curve has an AUC of 0.77.
ROC curve that arcs rightward and then upward from
                (0,0) to (1,1). The curve has an AUC of 0.31.
ROC curve that is approximately a straight line from (0,0) to
           (1,1), with a few zig-zags. The curve has an AUC of 0.508.
Which of the following models performs worse than chance?
ROC curve that arcs rightward and then upward from
                (0,0) to (1,1). The curve has an AUC of 0.32.
ROC curve that is a diagonal straight line from
                (0,0) to (1,1). The curve has an AUC of 0.5.
ROC curve that is approximately a straight line from
                     (0,0) to (1,1), with a few zig-zags. The curve has an
                     AUC of 0.508.
ROC curve that is composed of two perpendicular lines: a vertical
      line from (0,0) to (0,1) and a horizontal line from (0,1) to (1,1).
      This curve has an AUC of 1.0.

Imagine a situation where it's better to allow some spam to reach the inbox than to send a business-critical email to the spam folder. You've trained a spam classifier for this situation where the positive class is spam and the negative class is not-spam. Which of the following points on the ROC curve for your classifier is preferable?

A ROC curve of AUC=0.84 showing three points on the convex part of
       the curve that are close to (0,1). Point A is at approximately
       (0.25, 0.75). Point B is at approximately (0.30, 0.90), and is
       the point that maximizes TPR while minimizing FPR. Point
       C is at approximately (0.4, 0.95).
Point A
Point B
Point C


Comments

Popular posts from this blog

Simple Linear Regression - and Related Regression Loss Functions

Today's Topics: a. Regression Algorithms  b. Outliers - Explained in Simple Terms c. Common Regression Metrics Explained d. Overfitting and Underfitting e. How are Linear and Non Linear Regression Algorithms used in Neural Networks [Future study topics] Regression Algorithms Regression algorithms are a category of machine learning methods used to predict a continuous numerical value. Linear regression is a simple, powerful, and interpretable algorithm for this type of problem. Quick Example: These are the scores of students vs. the hours they spent studying. Looking at this dataset of student scores and their corresponding study hours, can we determine what score someone might achieve after studying for a random number of hours? Example: From the graph, we can estimate that 4 hours of daily study would result in a score near 80. It is a simple example, but for more complex tasks the underlying concept will be similar. If you understand this graph, you will understand this blog. Sim...

What problems can AI Neural Networks solve

How does AI Neural Networks solve Problems? What problems can AI Neural Networks solve? Based on effectiveness and common usage, here's the ranking from best to least suitable for neural networks (Classification Problems, Regression Problems and Optimization Problems.) But first some Math, background and related topics as how the Neural Network Learn by training (Supervised Learning and Unsupervised Learning.)  Background Note - Mathematical Precision vs. Practical AI Solutions. Math can solve all these problems with very accurate results. While Math can theoretically solve classification, regression, and optimization problems with perfect accuracy, such calculations often require impractical amounts of time—hours, days, or even years for complex real-world scenarios. In practice, we rarely need absolute precision; instead, we need actionable results quickly enough to make timely decisions. Neural networks excel at this trade-off, providing "good enough" solutions in seco...

Activation Functions in Neural Networks

  A Guide to Activation Functions in Neural Networks 🧠 Question: Without activation function can a neural network with many layers be non-linear? Answer: Provided at the end of this document. Activation functions are a crucial component of neural networks. Their primary purpose is to introduce non-linearity , which allows the network to learn the complex, winding patterns found in real-world data. Without them, a neural network, no matter how deep, would just be a simple linear model. In the diagram below the f is the activation function that receives input and send output to next layers. Commonly used activation functions. 1. Sigmoid Function 2. Tanh (Hyperbolic Tangent) 3. ReLU (Rectified Linear Unit - Like an Electronic Diode) 4. Leaky ReLU & PReLU 5. ELU (Exponential Linear Unit) 6. Softmax 7. GELU, Swish, and SiLU 1. Sigmoid Function                       The classic "S-curve," Sigmoid squashes any input value t...