Topics:
A. Classification Metrics
B. Class Imbalance
A. Classification Metrics
Precision and recall are key metrics used to evaluate a machine learning model's performance, calculated using a confusion matrix. Precision measures the ratio of correctly predicted positive observations to the total number of positive predictions, answering "Of all the times the model predicted 'yes,' how often was it correct?". Recall measures the ratio of correctly predicted positive observations to all actual positive observations, answering "Of all the actual positive cases, how many did the model find?".
Lets cover these topics
- "The Building Blocks: Understanding TP, TN, FP, and FN"
- Start with the foundation
- Use real examples (email spam, medical tests)
- "The Confusion Matrix: Your Performance Dashboard"
- Visual representation of the building blocks
- How to read and interpret it
- "Accuracy: The Misleading Metric"
- Why everyone starts here
- Why it often fails (the 99% accuracy trap)
- "Precision: When False Alarms Are Costly"
- "Of all my positive predictions, how many were correct?"
- When to optimize for precision
- "Recall (Sensitivity): When Missing Cases Is Dangerous"
- "Of all actual positives, how many did I catch?"
- When to optimize for recall
- "F1 Score: The Balanced Compromise"
- Harmonic mean explained simply
- When F1 is and isn't appropriate
- "ROC Curve & AUC: The Big Picture"
- Performance across all thresholds
- What AUC really means
- "Precision-Recall Curve: The Better Choice for Imbalanced Data"
- When to use instead of ROC
- Real-world applications
Important:
F1 Score
- Harmonic mean of precision and recall
- When and why to use it
- F-beta score variations
Accuracy (and its limitations)
- Why accuracy can be misleading
- The accuracy paradox with imbalanced datasets
True/False Positives and Negatives (TP, TN, FP, FN)
- Clear definitions with examples
- How they build the confusion matrix
ROC Curve and AUC
- Receiver Operating Characteristic curve
- Area Under the Curve
- Visual interpretation
Precision-Recall Curve
- When it's better than ROC
- Especially for imbalanced datasets
Important Supporting Topics:
Class Imbalance Problem
- Why it matters for metrics
- Which metrics are robust to imbalance
Threshold Tuning
- Moving beyond default 0.5
- Trading off precision vs recall
Multi-class Classification Metrics
- Macro vs Micro vs Weighted averaging
- One-vs-Rest approach
Specificity and Sensitivity
- Medical/diagnostic terminology
- Relationship to recall and precision
Real-World Examples
- Medical diagnosis (cancer detection)
- Spam filtering
- Fraud detection
- Each showing different metric priorities
Advanced But Useful:
Matthews Correlation Coefficient (MCC)
- Balanced measure even for imbalanced classes
Cohen's Kappa
- Agreement beyond chance
Cost-Sensitive Learning
- When false positives/negatives have different costs
Visual Elements to Include:
- Confusion matrix heatmaps
- Precision-Recall trade-off graphs
- ROC curve comparisons
- Interactive threshold slider (if possible)
Common Pitfalls Section:
- Why 99% accuracy might be terrible
- When to optimize for precision vs recall
- The base rate fallacy
B. Class Imbalance
What is Class Imbalance?
Class imbalance occurs when one class has significantly more samples than another in your dataset.
Example:
- Credit card fraud: 99.8% legitimate transactions, 0.2% fraud
- Disease detection: 95% healthy patients, 5% diseased
- Email classification: 90% normal emails, 10% spam
Why is it a Problem?
The Accuracy Trap:
Dataset: 990 normal transactions, 10 fraudulent
Model predicts: "Everything is normal"
Accuracy: 99%!
But... caught 0 frauds
The model looks great (99% accurate) but is completely useless - it never catches fraud!
Problems Caused by Imbalance:
- Model bias - Algorithms favor the majority class
- Poor minority class detection - Rare events get ignored
- Misleading metrics - High accuracy hides poor performance
- Learning difficulty - Not enough minority examples to learn patterns
Techniques to Handle Class Imbalance
1. Resampling Techniques
A. Oversampling (Increase Minority Class)
- Random Oversampling: Duplicate minority samples
- SMOTE: Generate synthetic samples (see detailed explanation below)
- ADASYN: Adaptive synthetic sampling
B. Undersampling (Reduce Majority Class)
- Random Undersampling: Remove majority samples randomly
- Tomek Links: Remove borderline majority samples
- Edited Nearest Neighbors: Remove noisy samples
C. Combination Methods
- SMOTEENN: SMOTE + Edited Nearest Neighbors
- SMOTETomek: SMOTE + Tomek Links
2. Algorithm-Level Approaches
Class Weight Adjustment:
# In sklearn
model = LogisticRegression(class_weight='balanced')
# Automatically adjusts weights inversely proportional to class frequencies
Cost-Sensitive Learning:
- Assign higher penalty for misclassifying minority class
- Custom loss functions
3. Ensemble Methods
- BalancedRandomForest: Balances each bootstrap sample
- EasyEnsemble: Multiple undersampled subsets
- RUSBoost: Boosting with undersampling
4. Metric Selection
Instead of accuracy, use:
- Precision, Recall, F1-Score
- Area Under Precision-Recall Curve
- Matthews Correlation Coefficient
- Balanced Accuracy
5. Threshold Moving
- Adjust decision threshold (not always 0.5)
- Optimize for business objectives
SMOTE (Synthetic Minority Over-sampling Technique)
What is SMOTE?
SMOTE creates synthetic samples of the minority class rather than just duplicating existing ones.
How SMOTE Works:
Step-by-step Process:
-
Select a minority sample
Point A: [age=25, income=30K] -
Find k nearest neighbors (typically k=5)
Neighbor B: [age=27, income=32K] Neighbor C: [age=23, income=28K] -
Randomly choose one neighbor
Selected: Neighbor B -
Create synthetic sample along the line between them
Random factor λ = 0.7 Synthetic = A + λ × (B - A) New point: [age=26.4, income=31.4K] -
Repeat until desired balance
Visual Example:
Before SMOTE: After SMOTE:
○ ○ ○ ○ ○ ○ ○ ○ ○ ○
○ ○ ○ ○ ○ ○ ○ ○ ○ ○
● ● ◆ ●
● ◆ ● ◆
◆ ◆
○ = Majority class
● = Original minority
◆ = Synthetic samples
SMOTE Code Example:
from imblearn.over_sampling import SMOTE
# Original imbalanced data
X_train, y_train # 990 normal, 10 fraud
# Apply SMOTE
smote = SMOTE(sampling_strategy='auto', k_neighbors=5)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
# Now balanced: 990 normal, 990 fraud (synthetic included)
SMOTE Advantages:
- Creates diverse synthetic samples
- Reduces overfitting vs simple duplication
- Helps decision boundary
SMOTE Limitations:
- Can create noisy samples if minority samples are outliers
- Doesn't consider majority class distribution
- May not work well with high-dimensional data
SMOTE Variants:
- BorderlineSMOTE: Focus on borderline minority samples
- SVMSMOTE: Uses SVM to find support vectors
- KMeansSMOTE: Clusters before applying SMOTE
- SMOTE-NC: Handles categorical features
Best Practices for Imbalanced Data
Recommended Workflow:
- Start with proper metrics (not accuracy)
- Try algorithm-level solutions first (class weights)
- Experiment with resampling if needed
- Consider ensemble methods for best results
- Always validate on original distribution
Common Mistakes to Avoid:
- Don't apply SMOTE before split - Apply only to training data
- Don't evaluate on balanced data - Test on original distribution
- Don't ignore domain knowledge - Sometimes imbalance reflects reality
- Don't balance unnecessarily - Slight imbalance (60:40) often okay
Quick Decision Guide:
| Imbalance Ratio | Recommended Approach |
|---|---|
| < 1:3 | Often no action needed |
| 1:3 to 1:10 | Class weights or simple resampling |
| 1:10 to 1:100 | SMOTE + ensemble methods |
| > 1:100 | Anomaly detection approach |
Remember: The goal isn't perfect balance, but better minority class detection while maintaining overall performance!
L. Understanding Binary Classification Outcomes - False Negative/False Positive/Etc.
True Positive: The model predicts Positive and the reality is Positive. It correctly identified the specific condition or target you wanted.
False Positive (Type I Error): The model predicts Positive, but reality is Negative. A "False Alarm" where it incorrectly flags something harmless as the target.
True Negative: The model predicts Negative and the reality is Negative. It correctly recognized that the target condition was not present.
False Negative (Type II Error): The model predicts Negative, but reality is Positive. A "Miss" where the model completely failed to catch the target condition.
Think of it Like a Metal Detector at School:
Your school has a metal detector at the entrance to catch weapons. Every backpack gets scanned:
- Positive (1) = Weapon detected (the dangerous thing we're looking for)
- Negative (0) = Safe backpack (normal school supplies)
The Four Possible Outcomes:
1. TRUE POSITIVE - "Good Catch!"
- Detector said: "Weapon!"
- Reality: Kid HAD a knife
- Result: Dangerous item stopped, school stays safe ✓
2. TRUE NEGATIVE - "Clear to Go"
- Detector said: "All clear"
- Reality: Just books and lunch
- Result: Student walks through normally ✓
3. FALSE POSITIVE - "False Alarm!"
- Detector said: "Weapon!"
- Reality: Metal ruler in geometry set
- Result: Embarrassing bag search, late to class
4. FALSE NEGATIVE - "Totally Missed It"
- Detector said: "All clear"
- Reality: Ceramic knife went through
- Result: Dangerous item got into school
The Easy Memory Trick:
Second word = What the detector beeped:
- Positive = BEEP! (Alert!)
- Negative = Silence (All clear)
First word = Was it right?
- True = Correct call ✓
- False = Wrong call ✗
Why Different Mistakes Matter:
Medical Test for Strep Throat:
- False Negative = Send sick kid to school (infects everyone) 🤒
- False Positive = Take antibiotics unnecessarily (not ideal but safer)
Face ID on Your Phone:
- False Negative = Won't unlock for YOU (annoying!)
- False Positive = Unlocks for stranger (security breach!)
The Confusion Matrix:
It's just a 2×2 box showing these four outcomes - like a report card for your model showing where it gets "confused" between real threats and false alarms!
The goal? Maximize the "True" ones and minimize the "False" ones - but sometimes one type of mistake is WAY worse than the other!
Confusion Matrix: A 2×2 grid showing four possible outcomes when predicting binary (yes/no) results:
- True Positive (TP): Correctly predicted the positive class
- True Negative (TN): Correctly predicted the negative class
- False Positive (FP): Wrongly predicted positive (false alarm)
- False Negative (FN): Wrongly predicted negative (missed it)
Accuracy: (TP + TN) / Total predictions - the percentage you got right overall.
Precision: TP / (TP + FP) - Of everything you called positive, what percentage was actually positive? Answers: "How trustworthy are my positive predictions?"
Recall (Sensitivity): TP / (TP + FN) - Of all actual positives, what percentage did you catch? Answers: "Did I find all the important cases?"
The 99% Accuracy Trap:
Imagine detecting credit card fraud where only 1% of transactions are fraudulent. A model that predicts "not fraud" for everything achieves 99% accuracy but catches zero fraud - completely useless!
This model has:
- Accuracy: 99% ✓ (looks amazing!)
- Precision: Undefined (never predicts fraud)
- Recall: 0% ✗ (catches no fraud)
The lesson: In imbalanced datasets, accuracy hides failure. Precision tells you about false alarms, while recall reveals what you're missing. Always check all three metrics - especially when one class is rare but important.
What is F1 Score?
F1 Score is the harmonic mean of Precision and Recall:
Formula: F1 = 2 × (Precision × Recall) / (Precision + Recall)
Example Data:
- All TP, TN, FP, FN classifications are shown
- Counts are (TP:6, TN:7, FP:3, FN:4)
- Accuracy calculation: 13/20 = 0.65 (65%)
- Precision calculation: 6/9 = 0.67 (67%)
- Recall calculation is spot: 6/10 = 0.60 (60%)
F1 Score calculation: 2 × 0.402 / 1.27 = 0.804 / 1.27 = 0.633
Tips for color coding (not needed but helps):
- Color coding Green for correct (TP/TN), Red for errors (FP/FN)
- Add intuitive explanations:
- Precision (67%): "When we say someone has the disease, we're right 2 out of 3 times"
- Recall (60%): "We catch 6 out of 10 people who actually have the disease"
- F1 (63%): "Overall balance between precision and recall"
- Real-world interpretation:
- 4 sick patients were sent home (FN) - dangerous!
- 3 healthy patients were told they're sick (FP) - stressful but safer
The 65% accuracy looks "okay" but missing 40% of sick patients (recall=60%) could be life-threatening in real medical scenarios.
Significance of F1 Score?
F1 Score is the harmonic mean of Precision and Recall:
Formula: F1 = 2 × (Precision × Recall) / (Precision + Recall)
Why Not Just Average Them?
The harmonic mean punishes extreme values. Consider two models:
Model A:
- Precision: 100%, Recall: 10%
- Simple Average: (100 + 10) / 2 = 55%
- F1 Score: 2 × (1.0 × 0.1) / (1.0 + 0.1) = 18%
Model B:
- Precision: 60%, Recall: 60%
- Simple Average: 60%
- F1 Score: 60%
Model A catches almost nothing despite perfect precision - F1 reveals this weakness!
When to Use F1:
F1 is ideal when:
- You need balance between precision and recall
- False positives and false negatives are equally bad
- You have imbalanced classes
Real Example: In spam detection:
- High Precision only = Few false alarms but miss lots of spam
- High Recall only = Catch all spam but many false alarms
- High F1 = Good balance - catches most spam with minimal false alarms
Think of F1 as: "How well-rounded is my model?" A score of 0.8+ means strong performance in BOTH precision and recall, not just one.
F1-Score harmonically balances precision and recall into a single metric. When you can't afford to optimize just one, F1 provides the sweet spot. It penalizes extreme imbalances - a model with perfect precision but terrible recall still gets a poor F1.
F1 Score Range
The Range: 0 to 1 (or 0% to 100%)
- F1 = 0: Worst possible - either precision or recall (or both) is zero
- F1 = 1: Perfect score - both precision AND recall are perfect (100%)
Interpreting F1 Scores:
| F1 Score | Interpretation | Real-World Meaning |
|---|---|---|
| 0.0 - 0.3 | Poor | Model is failing badly |
| 0.3 - 0.5 | Below Average | Needs significant improvement |
| 0.5 - 0.7 | Average | Acceptable for some use cases |
| 0.7 - 0.8 | Good | Solid performance |
| 0.8 - 0.9 | Very Good | Strong model |
| 0.9 - 1.0 | Excellent | Outstanding (rare in practice) |
Key Points:
F1 can never exceed the lower of precision or recall
- If Precision = 90% and Recall = 60%, F1 can't be higher than 72%
F1 = 0 happens when:
- Model predicts all negative (Precision undefined, Recall = 0)
- Model predicts all positive for negative-only data (Precision = 0)
F1 = 1 requires:
- Precision = 100% (no false positives)
- Recall = 100% (no false negatives)
- Practically impossible in real-world problems
Typical good scores: Most production models achieve F1 scores between 0.6-0.85 depending on the problem difficulty.
All Three Metrics [Recall, Precision, F1 Score] Have the Same "Best" Value: 1 (or 100%)
Best F1 Score = 1
- Means both precision and recall are perfect
- Extremely rare in practice
Best Recall = 1
- You caught ALL positive cases (no false negatives)
- Example: Found all 100 cancer patients out of 100
Best Precision = 1
- ALL your positive predictions were correct (no false positives)
- Example: Every time you said "cancer," you were right
But Here's the Important Reality:
Getting all three to 1 is nearly impossible because:
- Perfect Recall (1.0) often means being overly aggressive - calling many things positive to catch everything, which hurts precision
- Perfect Precision (1.0) often means being overly conservative - only calling the super obvious cases positive, which hurts recall
- Perfect F1 (1.0) requires BOTH to be perfect simultaneously
Real-World "Good" Scores:
- Recall: 0.8-0.9 is excellent
- Precision: 0.8-0.9 is excellent
- F1: 0.7-0.85 is very good
The Trade-off:
Usually, you optimize for one based on your use case:
- Medical screening: Maximize recall (catch all diseases)
- Spam filtering: Balance both (F1)
- Legal document classification: Maximize precision (avoid false accusations)
So yes, mathematically 1 is best for all three, but practically, you rarely achieve it!
ROC Curves and AUC Explained
What This Graph Shows:
ROC (Receiver Operating Characteristic) Curve plots:
- X-axis: False Positive Rate (1 - Specificity) = FP/(FP+TN)
- Y-axis: True Positive Rate (Recall/Sensitivity) = TP/(TP+FN)
Understanding the Lines:
-
Diagonal Dotted Line = Random guessing (coin flip)
- AUC = 0.5
- Useless model
-
Blue Curve = "Better model"
- AUC = 0.9216
- Excellent performance
-
Orange Curve = "Worse model"
- AUC = 0.9062
- Still very good, but slightly worse
What AUC (Area Under Curve) Means:
- AUC = 1.0: Perfect classifier
- AUC = 0.9-1.0: Excellent (both models here!)
- AUC = 0.8-0.9: Good
- AUC = 0.7-0.8: Acceptable
- AUC = 0.5: No better than random
- AUC < 0.5: Worse than random (but flip predictions!)
Key Insights:
-
The curves show ALL possible thresholds - not just 0.5
- Each point = different threshold setting
- Moving right = lower threshold (more positive predictions)
-
Closer to top-left corner = Better
- Top-left = 100% TPR, 0% FPR (perfect)
- The blue curve reaches higher faster
-
Why AUC Matters:
- Single number comparing models
- Threshold-independent (tests all cutoffs)
- Works for imbalanced datasets
Practical meaning: The blue model (0.9216) correctly ranks a random positive example higher than a random negative example 92.16% of the time!
ROC Curves and AUC
The Core Problem
When you have a binary classifier that outputs probabilities (say, predicting whether an email is spam), you need to choose a threshold to convert those probabilities into yes/no decisions. At threshold 0.5, anything above becomes "positive." But what if you lower it to 0.3? You'll catch more true positives but also more false positives.
The ROC curve captures this entire trade-off in one picture.
Building the ROC Curve
For any threshold, you can compute two rates:
True Positive Rate (Sensitivity/Recall): $$TPR = \frac{TP}{TP + FN}$$ "Of all actual positives, how many did we catch?"
False Positive Rate: $$FPR = \frac{FP}{FP + TN}$$ "Of all actual negatives, how many did we incorrectly flag?"
The ROC curve plots TPR (y-axis) vs FPR (x-axis) as you sweep the threshold from 1.0 down to 0.0.
Interpreting the Curve
- Bottom-left corner (0,0): Threshold = 1.0. You predict nothing as positive. TPR = 0, FPR = 0.
- Top-right corner (1,1): Threshold = 0.0. You predict everything as positive. TPR = 1, FPR = 1.
- Top-left corner (0,1): The ideal point—perfect classification.
A diagonal line from (0,0) to (1,1) represents a random classifier (coin flip). Any useful model should bow toward the top-left, away from this diagonal.
AUC: Area Under the Curve
The AUC summarizes the entire ROC curve into a single number between 0 and 1.
Probabilistic interpretation: If you randomly pick one positive and one negative sample, the AUC equals the probability that your model ranks the positive sample higher than the negative one.
| AUC Value | Interpretation |
|---|---|
| 1.0 | Perfect classifier |
| 0.9–1.0 | Excellent |
| 0.8–0.9 | Good |
| 0.7–0.8 | Fair |
| 0.5 | Random guessing |
| < 0.5 | Worse than random (predictions inverted) |
Why Use ROC-AUC?
-
Threshold-independent: Evaluates the model's ranking ability across all possible operating points.
-
Class imbalance resilience: Unlike accuracy, ROC-AUC isn't fooled by imbalanced datasets. If 99% of emails are non-spam, a model predicting "not spam" always gets 99% accuracy but AUC = 0.5.
-
Comparing models: Two models with different optimal thresholds can be directly compared via AUC.
When ROC-AUC Falls Short
-
Severe class imbalance: When negatives vastly outnumber positives, even small FPRs can mean many false alarms in absolute terms. Precision-Recall curves are often better here.
-
Cost-sensitive applications: If the costs of false positives and false negatives differ greatly, you might care more about specific regions of the curve than the overall area.
Quick Example
Imagine a disease screening model:
- At threshold 0.8: Catches 60% of sick patients (TPR=0.6), flags 5% of healthy ones (FPR=0.05)
- At threshold 0.3: Catches 95% of sick patients (TPR=0.95), but flags 30% of healthy ones (FPR=0.30)
The ROC curve shows you this entire spectrum, letting clinicians choose based on whether missing a case or unnecessary testing is more costly.
- ROC Curve: A graph that visualizes a model's performance by plotting Sensitivity (True Positive Rate, TPR) on the y-axis against 1-Specificity (False Positive Rate, FPR) on the x-axis, at different classification thresholds.
- AUC (Area Under the Curve): The area under the ROC curve, giving a single value between 0 and 1.
- Discriminative Power: It quantifies how well the model can tell the difference between positive and negative classes.
- Threshold Independence: It provides a single score that summarizes performance across all possible thresholds, making it useful for comparing models without picking a specific cutoff.
- Probability: It can be interpreted as the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative one. [1, 2, 3, 7]
- It's a standard metric in fields like medical diagnosis and fraud detection because it balances sensitivity and specificity, providing a comprehensive view of a model's accuracy. [1, 4, 8]
Receiver-operating characteristic curve (ROC)
See: https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc#exercise_check_your_understanding
Also see: https://www.youtube.com/watch?v=4jRBRDbJemM
Another Blog: https://milindai.blogspot.com/2025/12/roc-and-auc-explained.html
The ROC curve is a visual representation of model performance across all thresholds. The long version of the name, receiver operating characteristic, is a holdover from WWII radar detection.
The ROC curve is drawn by calculating the true positive rate (TPR) and false positive rate (FPR) at every possible threshold (in practice, at selected intervals), then graphing TPR over FPR. A perfect model, which at some threshold has a TPR of 1.0 and a FPR of 0.0, can be represented by either a point at (0, 1) if all other thresholds are ignored, or by the following:

Area under the curve (AUC)
The area under the ROC curve (AUC) represents the probability that the model, if given a randomly chosen positive and negative example, will rank the positive higher than the negative.
The perfect model above, containing a square with sides of length 1, has an area under the curve (AUC) of 1.0. This means there is a 100% probability that the model will correctly rank a randomly chosen positive example higher than a randomly chosen negative example. In other words, looking at the spread of data points below, AUC gives the probability that the model will place a randomly chosen square to the right of a randomly chosen circle, independent of where the threshold is set.

In more concrete terms, a spam classifier with AUC of 1.0 always assigns a random spam email a higher probability of being spam than a random legitimate email. The actual classification of each email depends on the threshold that you choose.
For a binary classifier, a model that does exactly as well as random guesses or coin flips has a ROC that is a diagonal line from (0,0) to (1,1). The AUC is 0.5, representing a 50% probability of correctly ranking a random positive and negative example.
In the spam classifier example, a spam classifier with AUC of 0.5 assigns a random spam email a higher probability of being spam than a random legitimate email only half the time.

AUC and ROC for choosing model and threshold
AUC is a useful measure for comparing the performance of two different models, as long as the dataset is roughly balanced. The model with greater area under the curve is generally the better one.

The points on a ROC curve closest to (0,1) represent a range of the best-performing thresholds for the given model. As discussed in the Thresholds, Confusion matrix and Choice of metric and tradeoffs sections, the threshold you choose depends on which metric is most important to the specific use case. Consider the points A, B, and C in the following diagram, each representing a threshold:

If false positives (false alarms) are highly costly, it may make sense to choose a threshold that gives a lower FPR, like the one at point A, even if TPR is reduced. Conversely, if false positives are cheap and false negatives (missed true positives) highly costly, the threshold for point C, which maximizes TPR, may be preferable. If the costs are roughly equivalent, point B may offer the best balance between TPR and FPR.
Here is the ROC curve for the data we have seen before:
See:
https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
Receiver-operating characteristic curve (ROC)
Area under the ROC curve (AUC)
Exercise: Check your understanding





Comments
Post a Comment