How to Handle Training data

How to Handle Training data - for ML Models

ML — Machine Learning

How do we handle data?

1. Don't jump straight to Neural Networks or Deep Learning!

2. How to split data into Training, Validation and Test data (parts)

3. What is Data Leak

4. What is Forward Propagation and Backward Propagation?

5. Dropout in Neural Networks

This applies to both the Methods:

Non-Neural Network Methods — Traditional ML algorithms (e.g., SVM, Random Forest, Logistic Regression, KNN, Naive Bayes)
Neural Network Methods (Deep Learning) — Multi-layered networks that learn complex patterns

Important Note:

Don't jump straight to Neural Networks or Deep Learning!

Always check first if the problem can be solved using traditional machine learning models. Neural networks are powerful but come with added complexity, longer training times, and require more data.

Use Neural Networks when:

You have large amounts of data
The problem involves images, audio, video, or text
Traditional methods aren't giving good results
The patterns are highly complex and non-linear

Stick with traditional ML when:

You have limited data
The problem is relatively simple
You need interpretability/explainability
You have limited computational resources
Faster training and deployment is needed

Rule of thumb: Start simple, add complexity only when necessary.

Here are 5 famous non-neural network machine learning methods:

Decision Trees / Random Forest — Tree-based models that split data based on feature thresholds. Random Forest combines multiple trees for better accuracy.
Support Vector Machines (SVM) — Finds the optimal hyperplane to separate classes with maximum margin.
K-Nearest Neighbors (KNN) — Classifies based on majority vote of the K closest data points.
Naive Bayes — Probabilistic classifier based on Bayes' theorem, assumes feature independence.
Linear/Logistic Regression — Linear models for predicting continuous values (linear) or class probabilities (logistic).

Bonus mentions: Gradient Boosting (XGBoost, LightGBM), K-Means Clustering, Principal Component Analysis (PCA)

Handling Data: Train-Validation-Test Split

We split the given data into two main parts:

1. Training Data (80%)

a. Training Data (70%) — Model learns from this
b. Validation Data (10%) — Used for tuning hyperparameters and monitoring overfitting

2. Test Data (20%) — Completely unseen by the model

Keep it hidden until final evaluation
If you accidentally use it during training or tuning, it's called a Data Leak

Purpose of Each Split

Split	Purpose	When Used
Training	Model learns patterns/weights	During training
Validation	Tune hyperparameters, early stopping, model selection	During training (but model doesn't learn from it)
Test	Final unbiased evaluation	Only once, at the very end

Example

Suppose you have 1,000 samples:

Training: 700 samples → Model learns from this
Validation: 100 samples → Check if model is overfitting, tune learning rate, number of layers, etc.
Test: 200 samples → Final accuracy report (touch it only once!)

What is Data Leak?

Data leak occurs when information from the test set (or future data) accidentally influences the training process.

Examples of data leaks:

Using test data to select features
Normalizing data using mean/std of the entire dataset (including test)
Tuning hyperparameters based on test set performance

Why it's bad: Your reported accuracy becomes artificially inflated and won't reflect real-world performance.

Code Example (Python/Scikit-learn)

from sklearn.model_selection import train_test_split

# First split: 80% train+val, 20% test
train_val, test = train_test_split(data, test_size=0.20, random_state=42)

# Second split: 70% train, 10% validation (from the 80%)
train, val = train_test_split(train_val, test_size=0.125, random_state=42)
# Note: 0.125 of 80% = 10% of total

# Now:
# train = 70%
# val = 10%
# test = 20%

What is Forward Propagation and Backward Propagation?

Forward Propagation vs Backward Propagation

Forward Propagation

Definition: The process of passing input data through the network layer by layer to get an output (prediction). No learning happens here — we just compute the result.

What stays fixed:

Weights
Biases
Hyperparameters

What we do:

Input → Hidden Layers → Output
Apply activation functions at each layer
Get a prediction

Step-by-step process:

Input (X)
    ↓
[Layer 1] → Z1 = (W1 × X) + B1 → A1 = Activation(Z1)
    ↓
[Layer 2] → Z2 = (W2 × A1) + B2 → A2 = Activation(Z2)
    ↓
[Output] → Prediction (ŷ)
    ↓
Compare with actual (y) → Calculate Loss/Error

Example:

Suppose you're classifying images of cats vs dogs:

Input: Pixel values of an image
Forward pass: Image flows through the network
Output: Probability [0.8 cat, 0.2 dog]
Loss: Compare with actual label (say, cat = 1) → Calculate error

When is forward propagation used?

During training (to get predictions before backprop)
During validation (to check performance)
During testing (final evaluation)
During inference/production (real-world predictions)

Backward Propagation (Backpropagation)

Definition: The process of calculating gradients and updating weights/biases to minimize the error. This is where the network actually learns.

What gets updated:

Weights
Biases

What stays fixed:

Hyperparameters (learning rate, etc. — set before training)

Step-by-step process:

Loss/Error (from forward prop)
    ↓
Calculate gradient of loss w.r.t output layer weights (∂Loss/∂W)
    ↓
Propagate error backwards layer by layer (Chain Rule)
    ↓
Calculate gradients for each layer
    ↓
Update weights: W_new = W_old - (learning_rate × gradient)
    ↓
Update biases: B_new = B_old - (learning_rate × gradient)

Key concept — Chain Rule:

Backpropagation uses calculus (chain rule) to figure out how much each weight contributed to the error.

∂Loss/∂W1 = ∂Loss/∂A2 × ∂A2/∂Z2 × ∂Z2/∂A1 × ∂A1/∂Z1 × ∂Z1/∂W1

Example:

Continuing the cat/dog example:

Forward prop predicted: [0.8 cat, 0.2 dog]
Actual label: cat (1)
Loss: Small error (0.8 is close to 1)
Backprop: Calculate how to adjust weights so next time it predicts 0.85 or higher

When is backpropagation used?

Only during training
Never during validation, testing, or inference

Side-by-Side Comparison

Aspect	Forward Propagation	Backward Propagation
Direction	Input → Output	Output → Input
Purpose	Get prediction	Learn from errors
Weights	Used (not changed)	Updated
Biases	Used (not changed)	Updated
Math	Matrix multiplication + activation	Gradients + chain rule
When used	Training, validation, testing, inference	Training only

Simple Analogy

Forward Propagation: You take an exam and submit your answers. You get a score (prediction). You haven't learned anything yet — just attempted.

Backward Propagation: You review your mistakes, understand where you went wrong, and adjust your understanding (weights) so you do better next time.

One Training Iteration (Epoch)

1. Forward Prop  → Get prediction
2. Calculate Loss → How wrong were we?
3. Backward Prop → Calculate gradients
4. Update Weights → Adjust to reduce error
5. Repeat for all batches

Code Illustration (PyTorch)

for epoch in range(epochs):
    # Forward Propagation
    predictions = model(inputs)
    loss = loss_function(predictions, labels)
    
    # Backward Propagation
    optimizer.zero_grad()   # Clear previous gradients
    loss.backward()         # Calculate gradients (backprop)
    optimizer.step()        # Update weights and biases

Dropout in Neural Networks

Definition: Dropout is a regularization technique where we randomly "turn off" (deactivate) a percentage of neurons during training to prevent overfitting.

How it works

During each training iteration:

Randomly select neurons to deactivate (e.g., 50%)
These neurons don't participate in forward or backward propagation
Different neurons are dropped in each iteration

Without Dropout:          With Dropout (50%):
    
  O — O — O                 O — X — O
  |   |   |                 |       |
  O — O — O       →         X — O — X
  |   |   |                     |    
  O — O — O                 O — O — O

(All active)              (X = dropped)

Why use Dropout?

Problem	How Dropout Helps
Overfitting	Forces network to not rely on specific neurons
Co-adaptation	Prevents neurons from depending too much on each other
Generalization	Creates a more robust model that works on unseen data

Simple Analogy

Imagine a team project where the same 2 people do all the work. If they're absent, the team fails.

Dropout is like randomly making team members "absent" during practice — forcing everyone to learn the work. Now the team is stronger and doesn't depend on specific individuals.

Key Points

Applied only during training (turned off during validation/testing)
Common dropout rates: 20%–50%
Dropout rate = 0.5 means 50% of neurons are randomly dropped each iteration

Code Example (PyTorch)

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(100, 64),
    nn.ReLU(),
    nn.Dropout(0.5),    # 50% dropout
    nn.Linear(64, 10)
)

model.train()   # Dropout ON
model.eval()    # Dropout OFF

When to use Dropout?

Model is overfitting (training accuracy high, validation accuracy low)
You have a large network with many parameters
Limited training data

Does this help clarify the concept?

Let me clarify simply:

Dropout is ONLY applied during training.

Phase	Dropout
Training	ON ✓
Validation	OFF ✗
Testing	OFF ✗

Why?

Training: Dropout randomly disables neurons to prevent overfitting — it forces the network to learn more robust features.
Validation/Testing: You want the full model (all neurons active) to make the best possible predictions. No randomness — just evaluation.

Simple analogy:

Think of it like a sports team practicing with a handicap (e.g., playing with fewer players) to get stronger. But during the actual game (validation/test), you use all your players.

20 Interview Quiz Questions — Machine Learning Fundamentals

Data Handling & Splitting

1. What is the typical split ratio for training, validation, and test data?

2. What is the purpose of the validation set, and how does it differ from the test set?

3. What is a "Data Leak" in machine learning, and why is it problematic?

4. Give two examples of how data leakage can accidentally occur.

5. When should you use the test data — and how many times?

Neural Networks vs Traditional ML

6. Name 5 non-neural network machine learning algorithms.

7. Why shouldn't you jump straight to neural networks for every problem?

8. When is it appropriate to use traditional ML methods over deep learning?

9. When should you prefer neural networks over traditional ML?

10. What are the disadvantages of using neural networks compared to traditional ML?

Forward Propagation

11. What is forward propagation, and what is its purpose?

12. During forward propagation, are the weights and biases updated? Why or why not?

13. In which phases is forward propagation used — training, validation, testing, or inference?

14. What is the formula for computing a layer's output in forward propagation?

Backward Propagation

15. What is backpropagation, and what does it accomplish?

16. Which parameters are updated during backpropagation?

17. What mathematical concept is central to backpropagation for calculating gradients?

18. Is backpropagation used during validation or testing? Explain.

19. Write the weight update formula used in backpropagation.

Conceptual

20. Explain the difference between forward propagation and backward propagation using a simple analogy.

Answer Key

Q	Answer
1	70% training, 10% validation, 20% test
2	Validation is for tuning hyperparameters and monitoring overfitting; test is for final unbiased evaluation
3	When test/future data influences training, causing artificially inflated accuracy
4	Using test data for feature selection; normalizing using entire dataset's mean/std
5	Only once, at the very end for final evaluation
6	Decision Trees, Random Forest, SVM, KNN, Naive Bayes, Logistic Regression, XGBoost
7	Neural networks need more data, time, resources, and are less interpretable
8	Limited data, simple problems, need interpretability, limited compute
9	Large data, images/audio/text, complex non-linear patterns
10	Longer training, more data required, less explainability, higher compute cost
11	Passing input through the network layer by layer to get a prediction
12	No — weights and biases are only used, not updated
13	All four: training, validation, testing, and inference
14	Z = (W × X) + B, then A = Activation(Z)
15	Calculating gradients and updating weights/biases to minimize error
16	Weights and biases
17	Chain rule (calculus)
18	No — only during training
19	W_new = W_old - (learning_rate × gradient)
20	Forward prop = taking an exam; Backward prop = reviewing mistakes and learning from them

Artificial Intelligence Theory and Application

Search This Blog