ML — Machine Learning
How do we handle data?
This applies to both the Methods:
-
Non-Neural Network Methods — Traditional ML algorithms (e.g., SVM, Random Forest, Logistic Regression, KNN, Naive Bayes)
-
Neural Network Methods (Deep Learning) — Multi-layered networks that learn complex patterns
Important Note:
Don't jump straight to Neural Networks or Deep Learning!
Always check first if the problem can be solved using traditional machine learning models. Neural networks are powerful but come with added complexity, longer training times, and require more data.
Use Neural Networks when:
- You have large amounts of data
- The problem involves images, audio, video, or text
- Traditional methods aren't giving good results
- The patterns are highly complex and non-linear
Stick with traditional ML when:
- You have limited data
- The problem is relatively simple
- You need interpretability/explainability
- You have limited computational resources
- Faster training and deployment is needed
Rule of thumb: Start simple, add complexity only when necessary.
Here are 5 famous non-neural network machine learning methods:
- Decision Trees / Random Forest — Tree-based models that split data based on feature thresholds. Random Forest combines multiple trees for better accuracy.
- Support Vector Machines (SVM) — Finds the optimal hyperplane to separate classes with maximum margin.
- K-Nearest Neighbors (KNN) — Classifies based on majority vote of the K closest data points.
- Naive Bayes — Probabilistic classifier based on Bayes' theorem, assumes feature independence.
- Linear/Logistic Regression — Linear models for predicting continuous values (linear) or class probabilities (logistic).
Bonus mentions: Gradient Boosting (XGBoost, LightGBM), K-Means Clustering, Principal Component Analysis (PCA)
Handling Data: Train-Validation-Test Split
We split the given data into two main parts:
1. Training Data (80%)
- a. Training Data (70%) — Model learns from this
- b. Validation Data (10%) — Used for tuning hyperparameters and monitoring overfitting
2. Test Data (20%) — Completely unseen by the model
- Keep it hidden until final evaluation
- If you accidentally use it during training or tuning, it's called a Data Leak
Purpose of Each Split
| Split | Purpose | When Used |
|---|---|---|
| Training | Model learns patterns/weights | During training |
| Validation | Tune hyperparameters, early stopping, model selection | During training (but model doesn't learn from it) |
| Test | Final unbiased evaluation | Only once, at the very end |
Example
Suppose you have 1,000 samples:
- Training: 700 samples → Model learns from this
- Validation: 100 samples → Check if model is overfitting, tune learning rate, number of layers, etc.
- Test: 200 samples → Final accuracy report (touch it only once!)
What is Data Leak?
Data leak occurs when information from the test set (or future data) accidentally influences the training process.
Examples of data leaks:
- Using test data to select features
- Normalizing data using mean/std of the entire dataset (including test)
- Tuning hyperparameters based on test set performance
Why it's bad: Your reported accuracy becomes artificially inflated and won't reflect real-world performance.
Code Example (Python/Scikit-learn)
from sklearn.model_selection import train_test_split
# First split: 80% train+val, 20% test
train_val, test = train_test_split(data, test_size=0.20, random_state=42)
# Second split: 70% train, 10% validation (from the 80%)
train, val = train_test_split(train_val, test_size=0.125, random_state=42)
# Note: 0.125 of 80% = 10% of total
# Now:
# train = 70%
# val = 10%
# test = 20%
What is Forward Propagation and Backward Propagation?
Forward Propagation vs Backward Propagation
Forward Propagation
Definition: The process of passing input data through the network layer by layer to get an output (prediction). No learning happens here — we just compute the result.
What stays fixed:
- Weights
- Biases
- Hyperparameters
What we do:
- Input → Hidden Layers → Output
- Apply activation functions at each layer
- Get a prediction
Step-by-step process:
Input (X)
↓
[Layer 1] → Z1 = (W1 × X) + B1 → A1 = Activation(Z1)
↓
[Layer 2] → Z2 = (W2 × A1) + B2 → A2 = Activation(Z2)
↓
[Output] → Prediction (ŷ)
↓
Compare with actual (y) → Calculate Loss/Error
Example:
Suppose you're classifying images of cats vs dogs:
- Input: Pixel values of an image
- Forward pass: Image flows through the network
- Output: Probability [0.8 cat, 0.2 dog]
- Loss: Compare with actual label (say, cat = 1) → Calculate error
When is forward propagation used?
- During training (to get predictions before backprop)
- During validation (to check performance)
- During testing (final evaluation)
- During inference/production (real-world predictions)
Backward Propagation (Backpropagation)
Definition: The process of calculating gradients and updating weights/biases to minimize the error. This is where the network actually learns.
What gets updated:
- Weights
- Biases
What stays fixed:
- Hyperparameters (learning rate, etc. — set before training)
Step-by-step process:
Loss/Error (from forward prop)
↓
Calculate gradient of loss w.r.t output layer weights (∂Loss/∂W)
↓
Propagate error backwards layer by layer (Chain Rule)
↓
Calculate gradients for each layer
↓
Update weights: W_new = W_old - (learning_rate × gradient)
↓
Update biases: B_new = B_old - (learning_rate × gradient)
Key concept — Chain Rule:
Backpropagation uses calculus (chain rule) to figure out how much each weight contributed to the error.
∂Loss/∂W1 = ∂Loss/∂A2 × ∂A2/∂Z2 × ∂Z2/∂A1 × ∂A1/∂Z1 × ∂Z1/∂W1
Example:
Continuing the cat/dog example:
- Forward prop predicted: [0.8 cat, 0.2 dog]
- Actual label: cat (1)
- Loss: Small error (0.8 is close to 1)
- Backprop: Calculate how to adjust weights so next time it predicts 0.85 or higher
When is backpropagation used?
- Only during training
- Never during validation, testing, or inference
Side-by-Side Comparison
| Aspect | Forward Propagation | Backward Propagation |
|---|---|---|
| Direction | Input → Output | Output → Input |
| Purpose | Get prediction | Learn from errors |
| Weights | Used (not changed) | Updated |
| Biases | Used (not changed) | Updated |
| Math | Matrix multiplication + activation | Gradients + chain rule |
| When used | Training, validation, testing, inference | Training only |
Simple Analogy
Forward Propagation: You take an exam and submit your answers. You get a score (prediction). You haven't learned anything yet — just attempted.
Backward Propagation: You review your mistakes, understand where you went wrong, and adjust your understanding (weights) so you do better next time.
One Training Iteration (Epoch)
1. Forward Prop → Get prediction
2. Calculate Loss → How wrong were we?
3. Backward Prop → Calculate gradients
4. Update Weights → Adjust to reduce error
5. Repeat for all batches
Code Illustration (PyTorch)
for epoch in range(epochs):
# Forward Propagation
predictions = model(inputs)
loss = loss_function(predictions, labels)
# Backward Propagation
optimizer.zero_grad() # Clear previous gradients
loss.backward() # Calculate gradients (backprop)
optimizer.step() # Update weights and biases
Dropout in Neural Networks
Definition: Dropout is a regularization technique where we randomly "turn off" (deactivate) a percentage of neurons during training to prevent overfitting.
How it works
During each training iteration:
- Randomly select neurons to deactivate (e.g., 50%)
- These neurons don't participate in forward or backward propagation
- Different neurons are dropped in each iteration
Without Dropout: With Dropout (50%):
O — O — O O — X — O
| | | | |
O — O — O → X — O — X
| | | |
O — O — O O — O — O
(All active) (X = dropped)
Why use Dropout?
| Problem | How Dropout Helps |
|---|---|
| Overfitting | Forces network to not rely on specific neurons |
| Co-adaptation | Prevents neurons from depending too much on each other |
| Generalization | Creates a more robust model that works on unseen data |
Simple Analogy
Imagine a team project where the same 2 people do all the work. If they're absent, the team fails.
Dropout is like randomly making team members "absent" during practice — forcing everyone to learn the work. Now the team is stronger and doesn't depend on specific individuals.
Key Points
- Applied only during training (turned off during validation/testing)
- Common dropout rates: 20%–50%
- Dropout rate = 0.5 means 50% of neurons are randomly dropped each iteration
Code Example (PyTorch)
import torch.nn as nn
model = nn.Sequential(
nn.Linear(100, 64),
nn.ReLU(),
nn.Dropout(0.5), # 50% dropout
nn.Linear(64, 10)
)
model.train() # Dropout ON
model.eval() # Dropout OFF
When to use Dropout?
- Model is overfitting (training accuracy high, validation accuracy low)
- You have a large network with many parameters
- Limited training data
Does this help clarify the concept?
Let me clarify simply:
Dropout is ONLY applied during training.
| Phase | Dropout |
|---|---|
| Training | ON ✓ |
| Validation | OFF ✗ |
| Testing | OFF ✗ |
Why?
-
Training: Dropout randomly disables neurons to prevent overfitting — it forces the network to learn more robust features.
-
Validation/Testing: You want the full model (all neurons active) to make the best possible predictions. No randomness — just evaluation.
Simple analogy:
Think of it like a sports team practicing with a handicap (e.g., playing with fewer players) to get stronger. But during the actual game (validation/test), you use all your players.
20 Interview Quiz Questions — Machine Learning Fundamentals
Data Handling & Splitting
1. What is the typical split ratio for training, validation, and test data?
2. What is the purpose of the validation set, and how does it differ from the test set?
3. What is a "Data Leak" in machine learning, and why is it problematic?
4. Give two examples of how data leakage can accidentally occur.
5. When should you use the test data — and how many times?
Neural Networks vs Traditional ML
6. Name 5 non-neural network machine learning algorithms.
7. Why shouldn't you jump straight to neural networks for every problem?
8. When is it appropriate to use traditional ML methods over deep learning?
9. When should you prefer neural networks over traditional ML?
10. What are the disadvantages of using neural networks compared to traditional ML?
Forward Propagation
11. What is forward propagation, and what is its purpose?
12. During forward propagation, are the weights and biases updated? Why or why not?
13. In which phases is forward propagation used — training, validation, testing, or inference?
14. What is the formula for computing a layer's output in forward propagation?
Backward Propagation
15. What is backpropagation, and what does it accomplish?
16. Which parameters are updated during backpropagation?
17. What mathematical concept is central to backpropagation for calculating gradients?
18. Is backpropagation used during validation or testing? Explain.
19. Write the weight update formula used in backpropagation.
Conceptual
20. Explain the difference between forward propagation and backward propagation using a simple analogy.
Answer Key
| Q | Answer |
|---|---|
| 1 | 70% training, 10% validation, 20% test |
| 2 | Validation is for tuning hyperparameters and monitoring overfitting; test is for final unbiased evaluation |
| 3 | When test/future data influences training, causing artificially inflated accuracy |
| 4 | Using test data for feature selection; normalizing using entire dataset's mean/std |
| 5 | Only once, at the very end for final evaluation |
| 6 | Decision Trees, Random Forest, SVM, KNN, Naive Bayes, Logistic Regression, XGBoost |
| 7 | Neural networks need more data, time, resources, and are less interpretable |
| 8 | Limited data, simple problems, need interpretability, limited compute |
| 9 | Large data, images/audio/text, complex non-linear patterns |
| 10 | Longer training, more data required, less explainability, higher compute cost |
| 11 | Passing input through the network layer by layer to get a prediction |
| 12 | No — weights and biases are only used, not updated |
| 13 | All four: training, validation, testing, and inference |
| 14 | Z = (W × X) + B, then A = Activation(Z) |
| 15 | Calculating gradients and updating weights/biases to minimize error |
| 16 | Weights and biases |
| 17 | Chain rule (calculus) |
| 18 | No — only during training |
| 19 | W_new = W_old - (learning_rate × gradient) |
| 20 | Forward prop = taking an exam; Backward prop = reviewing mistakes and learning from them |
Comments
Post a Comment