Skip to main content

How to Handle Training data - for ML Models

ML — Machine Learning

How do we handle data?

1. Don't jump straight to Neural Networks or Deep Learning!
2. How to split data into Training, Validation and Test data (parts)
3. What is Data Leak
4. What is Forward Propagation and Backward Propagation?
5. Dropout in Neural Networks

This applies to both the Methods:

  1. Non-Neural Network Methods — Traditional ML algorithms (e.g., SVM, Random Forest, Logistic Regression, KNN, Naive Bayes)

  2. Neural Network Methods (Deep Learning) — Multi-layered networks that learn complex patterns


Important Note:

Don't jump straight to Neural Networks or Deep Learning!

Always check first if the problem can be solved using traditional machine learning models. Neural networks are powerful but come with added complexity, longer training times, and require more data.

Use Neural Networks when:

  • You have large amounts of data
  • The problem involves images, audio, video, or text
  • Traditional methods aren't giving good results
  • The patterns are highly complex and non-linear

Stick with traditional ML when:

  • You have limited data
  • The problem is relatively simple
  • You need interpretability/explainability
  • You have limited computational resources
  • Faster training and deployment is needed

Rule of thumb: Start simple, add complexity only when necessary.

Here are 5 famous non-neural network machine learning methods:

  1. Decision Trees / Random Forest — Tree-based models that split data based on feature thresholds. Random Forest combines multiple trees for better accuracy.
  2. Support Vector Machines (SVM) — Finds the optimal hyperplane to separate classes with maximum margin.
  3. K-Nearest Neighbors (KNN) — Classifies based on majority vote of the K closest data points.
  4. Naive Bayes — Probabilistic classifier based on Bayes' theorem, assumes feature independence.
  5. Linear/Logistic Regression — Linear models for predicting continuous values (linear) or class probabilities (logistic).

Bonus mentions: Gradient Boosting (XGBoost, LightGBM), K-Means Clustering, Principal Component Analysis (PCA)


Handling Data: Train-Validation-Test Split

We split the given data into two main parts:

1. Training Data (80%)

  • a. Training Data (70%) — Model learns from this
  • b. Validation Data (10%) — Used for tuning hyperparameters and monitoring overfitting

2. Test Data (20%) — Completely unseen by the model

  • Keep it hidden until final evaluation
  • If you accidentally use it during training or tuning, it's called a Data Leak

Purpose of Each Split

Split Purpose When Used
Training Model learns patterns/weights During training
Validation Tune hyperparameters, early stopping, model selection During training (but model doesn't learn from it)
Test Final unbiased evaluation Only once, at the very end

Example

Suppose you have 1,000 samples:

  • Training: 700 samples → Model learns from this
  • Validation: 100 samples → Check if model is overfitting, tune learning rate, number of layers, etc.
  • Test: 200 samples → Final accuracy report (touch it only once!)



What is Data Leak?

Data leak occurs when information from the test set (or future data) accidentally influences the training process.

Examples of data leaks:

  • Using test data to select features
  • Normalizing data using mean/std of the entire dataset (including test)
  • Tuning hyperparameters based on test set performance

Why it's bad: Your reported accuracy becomes artificially inflated and won't reflect real-world performance.


Code Example (Python/Scikit-learn)

from sklearn.model_selection import train_test_split

# First split: 80% train+val, 20% test
train_val, test = train_test_split(data, test_size=0.20, random_state=42)

# Second split: 70% train, 10% validation (from the 80%)
train, val = train_test_split(train_val, test_size=0.125, random_state=42)
# Note: 0.125 of 80% = 10% of total

# Now:
# train = 70%
# val = 10%
# test = 20%

What is Forward Propagation and Backward Propagation?



Forward Propagation vs Backward Propagation


Forward Propagation

Definition: The process of passing input data through the network layer by layer to get an output (prediction). No learning happens here — we just compute the result.

What stays fixed:

  • Weights
  • Biases
  • Hyperparameters

What we do:

  • Input → Hidden Layers → Output
  • Apply activation functions at each layer
  • Get a prediction

Step-by-step process:

Input (X)
    ↓
[Layer 1] → Z1 = (W1 × X) + B1 → A1 = Activation(Z1)
    ↓
[Layer 2] → Z2 = (W2 × A1) + B2 → A2 = Activation(Z2)
    ↓
[Output] → Prediction (ŷ)
    ↓
Compare with actual (y) → Calculate Loss/Error

Example:

Suppose you're classifying images of cats vs dogs:

  • Input: Pixel values of an image
  • Forward pass: Image flows through the network
  • Output: Probability [0.8 cat, 0.2 dog]
  • Loss: Compare with actual label (say, cat = 1) → Calculate error

When is forward propagation used?

  • During training (to get predictions before backprop)
  • During validation (to check performance)
  • During testing (final evaluation)
  • During inference/production (real-world predictions)

Backward Propagation (Backpropagation)

Definition: The process of calculating gradients and updating weights/biases to minimize the error. This is where the network actually learns.

What gets updated:

  • Weights
  • Biases

What stays fixed:

  • Hyperparameters (learning rate, etc. — set before training)

Step-by-step process:

Loss/Error (from forward prop)
    ↓
Calculate gradient of loss w.r.t output layer weights (∂Loss/∂W)
    ↓
Propagate error backwards layer by layer (Chain Rule)
    ↓
Calculate gradients for each layer
    ↓
Update weights: W_new = W_old - (learning_rate × gradient)
    ↓
Update biases: B_new = B_old - (learning_rate × gradient)

Key concept — Chain Rule:

Backpropagation uses calculus (chain rule) to figure out how much each weight contributed to the error.

∂Loss/∂W1 = ∂Loss/∂A2 × ∂A2/∂Z2 × ∂Z2/∂A1 × ∂A1/∂Z1 × ∂Z1/∂W1

Example:

Continuing the cat/dog example:

  • Forward prop predicted: [0.8 cat, 0.2 dog]
  • Actual label: cat (1)
  • Loss: Small error (0.8 is close to 1)
  • Backprop: Calculate how to adjust weights so next time it predicts 0.85 or higher

When is backpropagation used?

  • Only during training
  • Never during validation, testing, or inference

Side-by-Side Comparison

Aspect Forward Propagation Backward Propagation
Direction Input → Output Output → Input
Purpose Get prediction Learn from errors
Weights Used (not changed) Updated
Biases Used (not changed) Updated
Math Matrix multiplication + activation Gradients + chain rule
When used Training, validation, testing, inference Training only

Simple Analogy

Forward Propagation: You take an exam and submit your answers. You get a score (prediction). You haven't learned anything yet — just attempted.

Backward Propagation: You review your mistakes, understand where you went wrong, and adjust your understanding (weights) so you do better next time.


One Training Iteration (Epoch)

1. Forward Prop  → Get prediction
2. Calculate Loss → How wrong were we?
3. Backward Prop → Calculate gradients
4. Update Weights → Adjust to reduce error
5. Repeat for all batches

Code Illustration (PyTorch)

for epoch in range(epochs):
    # Forward Propagation
    predictions = model(inputs)
    loss = loss_function(predictions, labels)
    
    # Backward Propagation
    optimizer.zero_grad()   # Clear previous gradients
    loss.backward()         # Calculate gradients (backprop)
    optimizer.step()        # Update weights and biases

Dropout in Neural Networks

Definition: Dropout is a regularization technique where we randomly "turn off" (deactivate) a percentage of neurons during training to prevent overfitting.


How it works

During each training iteration:

  • Randomly select neurons to deactivate (e.g., 50%)
  • These neurons don't participate in forward or backward propagation
  • Different neurons are dropped in each iteration
Without Dropout:          With Dropout (50%):
    
  O — O — O                 O — X — O
  |   |   |                 |       |
  O — O — O       →         X — O — X
  |   |   |                     |    
  O — O — O                 O — O — O

(All active)              (X = dropped)

Why use Dropout?

Problem How Dropout Helps
Overfitting Forces network to not rely on specific neurons
Co-adaptation Prevents neurons from depending too much on each other
Generalization Creates a more robust model that works on unseen data

Simple Analogy

Imagine a team project where the same 2 people do all the work. If they're absent, the team fails.

Dropout is like randomly making team members "absent" during practice — forcing everyone to learn the work. Now the team is stronger and doesn't depend on specific individuals.


Key Points

  • Applied only during training (turned off during validation/testing)
  • Common dropout rates: 20%–50%
  • Dropout rate = 0.5 means 50% of neurons are randomly dropped each iteration

Code Example (PyTorch)

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(100, 64),
    nn.ReLU(),
    nn.Dropout(0.5),    # 50% dropout
    nn.Linear(64, 10)
)

model.train()   # Dropout ON
model.eval()    # Dropout OFF

When to use Dropout?

  • Model is overfitting (training accuracy high, validation accuracy low)
  • You have a large network with many parameters
  • Limited training data

Does this help clarify the concept?

Let me clarify simply:

Dropout is ONLY applied during training.

Phase Dropout
Training ON ✓
Validation OFF ✗
Testing OFF ✗

Why?

  • Training: Dropout randomly disables neurons to prevent overfitting — it forces the network to learn more robust features.

  • Validation/Testing: You want the full model (all neurons active) to make the best possible predictions. No randomness — just evaluation.

Simple analogy:

Think of it like a sports team practicing with a handicap (e.g., playing with fewer players) to get stronger. But during the actual game (validation/test), you use all your players.

20 Interview Quiz Questions — Machine Learning Fundamentals


Data Handling & Splitting

1. What is the typical split ratio for training, validation, and test data?

2. What is the purpose of the validation set, and how does it differ from the test set?

3. What is a "Data Leak" in machine learning, and why is it problematic?

4. Give two examples of how data leakage can accidentally occur.

5. When should you use the test data — and how many times?


Neural Networks vs Traditional ML

6. Name 5 non-neural network machine learning algorithms.

7. Why shouldn't you jump straight to neural networks for every problem?

8. When is it appropriate to use traditional ML methods over deep learning?

9. When should you prefer neural networks over traditional ML?

10. What are the disadvantages of using neural networks compared to traditional ML?


Forward Propagation

11. What is forward propagation, and what is its purpose?

12. During forward propagation, are the weights and biases updated? Why or why not?

13. In which phases is forward propagation used — training, validation, testing, or inference?

14. What is the formula for computing a layer's output in forward propagation?


Backward Propagation

15. What is backpropagation, and what does it accomplish?

16. Which parameters are updated during backpropagation?

17. What mathematical concept is central to backpropagation for calculating gradients?

18. Is backpropagation used during validation or testing? Explain.

19. Write the weight update formula used in backpropagation.


Conceptual

20. Explain the difference between forward propagation and backward propagation using a simple analogy.


Answer Key

Q Answer
1 70% training, 10% validation, 20% test
2 Validation is for tuning hyperparameters and monitoring overfitting; test is for final unbiased evaluation
3 When test/future data influences training, causing artificially inflated accuracy
4 Using test data for feature selection; normalizing using entire dataset's mean/std
5 Only once, at the very end for final evaluation
6 Decision Trees, Random Forest, SVM, KNN, Naive Bayes, Logistic Regression, XGBoost
7 Neural networks need more data, time, resources, and are less interpretable
8 Limited data, simple problems, need interpretability, limited compute
9 Large data, images/audio/text, complex non-linear patterns
10 Longer training, more data required, less explainability, higher compute cost
11 Passing input through the network layer by layer to get a prediction
12 No — weights and biases are only used, not updated
13 All four: training, validation, testing, and inference
14 Z = (W × X) + B, then A = Activation(Z)
15 Calculating gradients and updating weights/biases to minimize error
16 Weights and biases
17 Chain rule (calculus)
18 No — only during training
19 W_new = W_old - (learning_rate × gradient)
20 Forward prop = taking an exam; Backward prop = reviewing mistakes and learning from them






Comments

Popular posts from this blog

Simple Linear Regression - and Related Regression Loss Functions

Today's Topics: a. Regression Algorithms  b. Outliers - Explained in Simple Terms c. Common Regression Metrics Explained d. Overfitting and Underfitting e. How are Linear and Non Linear Regression Algorithms used in Neural Networks [Future study topics] Regression Algorithms Regression algorithms are a category of machine learning methods used to predict a continuous numerical value. Linear regression is a simple, powerful, and interpretable algorithm for this type of problem. Quick Example: These are the scores of students vs. the hours they spent studying. Looking at this dataset of student scores and their corresponding study hours, can we determine what score someone might achieve after studying for a random number of hours? Example: From the graph, we can estimate that 4 hours of daily study would result in a score near 80. It is a simple example, but for more complex tasks the underlying concept will be similar. If you understand this graph, you will understand this blog. Sim...

What problems can AI Neural Networks solve

How does AI Neural Networks solve Problems? What problems can AI Neural Networks solve? Based on effectiveness and common usage, here's the ranking from best to least suitable for neural networks (Classification Problems, Regression Problems and Optimization Problems.) But first some Math, background and related topics as how the Neural Network Learn by training (Supervised Learning and Unsupervised Learning.)  Background Note - Mathematical Precision vs. Practical AI Solutions. Math can solve all these problems with very accurate results. While Math can theoretically solve classification, regression, and optimization problems with perfect accuracy, such calculations often require impractical amounts of time—hours, days, or even years for complex real-world scenarios. In practice, we rarely need absolute precision; instead, we need actionable results quickly enough to make timely decisions. Neural networks excel at this trade-off, providing "good enough" solutions in seco...

Activation Functions in Neural Networks

  A Guide to Activation Functions in Neural Networks 🧠 Question: Without activation function can a neural network with many layers be non-linear? Answer: Provided at the end of this document. Activation functions are a crucial component of neural networks. Their primary purpose is to introduce non-linearity , which allows the network to learn the complex, winding patterns found in real-world data. Without them, a neural network, no matter how deep, would just be a simple linear model. In the diagram below the f is the activation function that receives input and send output to next layers. Commonly used activation functions. 1. Sigmoid Function 2. Tanh (Hyperbolic Tangent) 3. ReLU (Rectified Linear Unit - Like an Electronic Diode) 4. Leaky ReLU & PReLU 5. ELU (Exponential Linear Unit) 6. Softmax 7. GELU, Swish, and SiLU 1. Sigmoid Function                       The classic "S-curve," Sigmoid squashes any input value t...