Skip to main content

Random Variables and Distributions - Complete Guide

 

Random Variables and Distributions - Complete Guide


Random Variables: A random variable is a function that maps outcomes of random experiments to numerical values, not a traditional "variable" but a mathematical transformation. For example, in coin flipping, we might assign heads=1 and tails=0. Two types exist: discrete random variables take countable values (0,1,2,3...) like number of customers or dice rolls, while continuous random variables take any value within intervals (height, time, temperature). Random variables are characterized by their probability distributions, expected value (long-run average), and variance (measure of spread). They provide the mathematical foundation for modeling uncertainty in real-world phenomena, enabling statistical analysis, predictions, and decision-making across virtually all quantitative fields including science, engineering, finance, and machine learning.

Probability Distribution Functions: These mathematical functions describe how probabilities are distributed across possible values of a random variable. Three main types exist: PMF (Probability Mass Function) for discrete variables provides exact probabilities P(X=x) for each value, like P(die=3)=1/6. PDF (Probability Density Function) for continuous variables describes probability density where area under the curve gives probability over intervals; exact point probabilities equal zero. CDF (Cumulative Distribution Function) works for both types, giving F(x)=P(X≤x), the probability X doesn't exceed x. All distribution functions must be non-negative and sum/integrate to one. They characterize random variables completely, enabling calculation of probabilities, means, variances, and supporting statistical inference across science, engineering, finance, and data science applications. Common examples: Binomial (coin flips), Poisson (events over time), Normal (bell curve). These mathematical tools model uncertainty in real-world phenomena.

Discrete Distributions: Discrete distributions describe random variables that take countable values (0,1,2,3...) with gaps between possible outcomes. Key distributions include: Bernoulli (single yes/no trial with probability p), Binomial (counting successes in n independent trials), Poisson (events occurring in fixed time/space intervals at average rate λ), Geometric (trials until first success), and Negative Binomial (trials until r-th success). Each uses a Probability Mass Function (PMF) giving exact probabilities P(X=x) for each value, where all probabilities are non-negative and sum to one. Common applications include quality control (defective items), customer arrivals, test questions answered correctly, and disease occurrences. These distributions model real-world scenarios involving counting and enable probability calculations, hypothesis testing, and statistical inference.

Expectation and Variance: Expectation (or expected value) E[X] is the long-run average value of a random variable, calculated as Σx·P(X=x) for discrete or ∫x·f(x)dx for continuous variables. It represents the "center" of the distribution. Variance Var(X)=E[(X-μ)²] measures spread or variability around the mean, indicating how dispersed values are from expectation. Computed as E[X²]-(E[X])², variance is always non-negative. Standard deviation σ=√Var(X) provides spread in original units. Key properties: E[aX+b]=aE[X]+b (linearity), Var(aX+b)=a²Var(X) (scaling), and for independent variables, E[X+Y]=E[X]+E[Y] and Var(X+Y)=Var(X)+Var(Y). These fundamental measures characterize distributions and enable statistical inference, risk assessment, and decision-making across all quantitative fields.

Let me explain this comprehensively, building from the basics to more advanced concepts.


Part 1: What is a Random Variable?

The Intuitive Idea

A random variable is NOT actually a "variable" in the traditional sense - it's a function that assigns numerical values to outcomes of a random experiment.

Think of it like this:

  • You perform a random experiment (flip coins, roll dice, measure height)
  • Different outcomes can happen
  • A random variable assigns a NUMBER to each possible outcome

Formal Definition

A random variable is a function that maps outcomes from a sample space to real numbers.

Random Variable: X : S → ℝ
Where S = Sample Space (all possible outcomes)

Simple Example: Coin Flips

Experiment: Flip a coin twice

Sample Space: S = {HH, HT, TH, TT}

Random Variable X = "Number of heads"

  • X(HH) = 2
  • X(HT) = 1
  • X(TH) = 1
  • X(TT) = 0

The random variable X assigns numbers to outcomes!


Part 2: Types of Random Variables

There are two main types:

1. Discrete Random Variables

Definition: Can take on countable values (you can list them: 0, 1, 2, 3, ...)

Examples:

  • Number of heads in 10 coin flips: {0, 1, 2, ..., 10}
  • Number of customers per hour: {0, 1, 2, 3, ...}
  • Number of defective items in a batch: {0, 1, 2, ..., n}
  • Score on a die roll: {1, 2, 3, 4, 5, 6}

Key characteristic: There are gaps between possible values.

2. Continuous Random Variables

Definition: Can take on any value within an interval (uncountably infinite)

Examples:

  • Height of a person: any value between 0 and 8 feet
  • Time until next phone call: any value ≥ 0
  • Temperature: any real number
  • Weight of a product: any positive real number

Key characteristic: No gaps - every value in an interval is possible.


Part 3: Probability Distributions

A probability distribution describes how probabilities are distributed over the values of a random variable.

For Discrete Random Variables: PMF

PMF = Probability Mass Function

The PMF gives the probability that X equals a specific value:

P(X = x) = probability that random variable X equals value x

Properties:

  1. Non-negative: P(X = x) ≥ 0 for all x
  2. Sums to 1: Σ P(X = x) = 1 (sum over all possible values)

Example: Die Roll

X = outcome of rolling a fair die

x 1 2 3 4 5 6
P(X=x) 1/6 1/6 1/6 1/6 1/6 1/6

For Continuous Random Variables: PDF

PDF = Probability Density Function

For continuous variables, P(X = exact value) = 0 (infinitely many possibilities!)

Instead, we use the PDF f(x) and calculate probabilities over intervals:

P(a ≤ X ≤ b) = ∫ₐᵇ f(x) dx

Properties:

  1. Non-negative: f(x) ≥ 0 for all x
  2. Integrates to 1: ∫₋∞^∞ f(x) dx = 1

Important: f(x) itself is NOT a probability! It's a density. The area under the curve gives probability.

Example: Uniform Distribution on [0, 1]

f(x) = 1  for 0 ≤ x ≤ 1
f(x) = 0  otherwise

P(0.2 ≤ X ≤ 0.5) = ∫₀.₂^⁰·⁵ 1 dx = 0.3


Part 4: Cumulative Distribution Function (CDF)

The CDF works for BOTH discrete and continuous random variables.

Definition:

F(x) = P(X ≤ x)

The CDF gives the probability that X is less than or equal to x.

Properties:

  1. Non-decreasing: If x₁ < x₂, then F(x₁) ≤ F(x₂)
  2. Limits: lim(x→-∞) F(x) = 0, lim(x→∞) F(x) = 1
  3. Right-continuous: F(x) is continuous from the right

For Discrete Variables:

F(x) = Σ P(X = k) for all k ≤ x

For Continuous Variables:

F(x) = ∫₋∞ˣ f(t) dt

And the PDF is the derivative of CDF:

f(x) = dF(x)/dx

Example: Die Roll CDF

x x < 1 1 ≤ x < 2 2 ≤ x < 3 3 ≤ x < 4 4 ≤ x < 5 5 ≤ x < 6 x ≥ 6
F(x) 0 1/6 2/6 3/6 4/6 5/6 1

Part 5: Expected Value (Mean)

The expected value E[X] is the long-run average value of the random variable.

For Discrete Random Variables

E[X] = μ = Σ x · P(X = x)

Sum of (each value × its probability)

Example: Die Roll

E[X] = 1(1/6) + 2(1/6) + 3(1/6) + 4(1/6) + 5(1/6) + 6(1/6)
     = (1+2+3+4+5+6)/6
     = 21/6
     = 3.5

For Continuous Random Variables

E[X] = μ = ∫₋∞^∞ x · f(x) dx

Properties of Expected Value:

  1. Linearity: E[aX + b] = aE[X] + b
  2. Linearity for sums: E[X + Y] = E[X] + E[Y]
  3. For independent variables: E[XY] = E[X]·E[Y]

Part 6: Variance and Standard Deviation

Variance measures how spread out the distribution is around the mean.

Definition

Var(X) = σ² = E[(X - μ)²]

Alternative formula (easier to compute):

Var(X) = E[X²] - [E[X]]²

For Discrete Random Variables

Var(X) = Σ (x - μ)² · P(X = x)

Or:

Var(X) = [Σ x² · P(X = x)] - μ²

For Continuous Random Variables

Var(X) = ∫₋∞^∞ (x - μ)² · f(x) dx

Standard Deviation

σ = √Var(X)

The standard deviation has the same units as X, making it more interpretable.

Properties of Variance:

  1. Var(aX + b) = a²Var(X) (constants affect variance differently!)
  2. For independent X, Y: Var(X + Y) = Var(X) + Var(Y)
  3. Always non-negative: Var(X) ≥ 0

Example: Die Roll

E[X²] = 1²(1/6) + 2²(1/6) + 3²(1/6) + 4²(1/6) + 5²(1/6) + 6²(1/6)
      = (1+4+9+16+25+36)/6
      = 91/6

Var(X) = 91/6 - (3.5)²
       = 91/6 - 12.25
       = 15.17/6 - 12.25
       ≈ 2.92

σ = √2.92 ≈ 1.71

Part 7: Common Discrete Distributions

1. Bernoulli Distribution

Models: Single yes/no trial

Parameters: p (probability of success)

PMF: P(X = 1) = p, P(X = 0) = 1-p

Mean: E[X] = p

Variance: Var(X) = p(1-p)

2. Binomial Distribution

Models: Number of successes in n independent Bernoulli trials

Parameters: n (trials), p (success probability)

PMF: P(X = k) = C(n,k) · p^k · (1-p)^(n-k)

Mean: E[X] = np

Variance: Var(X) = np(1-p)

3. Poisson Distribution

Models: Number of events in a fixed interval (time/space)

Parameters: λ (average rate)

PMF: P(X = k) = (λ^k · e^(-λ)) / k!

Mean: E[X] = λ

Variance: Var(X) = λ

4. Geometric Distribution

Models: Number of trials until first success

Parameters: p (success probability)

PMF: P(X = k) = (1-p)^(k-1) · p

Mean: E[X] = 1/p

Variance: Var(X) = (1-p)/p²

5. Negative Binomial Distribution

Models: Number of trials until r-th success

Parameters: r (successes), p (success probability)

Mean: E[X] = r/p

Variance: Var(X) = r(1-p)/p²


Part 8: Common Continuous Distributions

1. Uniform Distribution

Models: Equal probability over an interval [a, b]

PDF: f(x) = 1/(b-a) for a ≤ x ≤ b

Mean: E[X] = (a+b)/2

Variance: Var(X) = (b-a)²/12

2. Exponential Distribution

Models: Time until next event (waiting times)

Parameters: λ (rate parameter)

PDF: f(x) = λe^(-λx) for x ≥ 0

CDF: F(x) = 1 - e^(-λx)

Mean: E[X] = 1/λ

Variance: Var(X) = 1/λ²

Memoryless property: P(X > s+t | X > s) = P(X > t)

3. Normal (Gaussian) Distribution

Models: Many natural phenomena (heights, test scores, errors)

Parameters: μ (mean), σ² (variance)

PDF: f(x) = (1/(σ√(2π))) · e^(-(x-μ)²/(2σ²))

Notation: X ~ N(μ, σ²)

Mean: E[X] = μ

Variance: Var(X) = σ²

68-95-99.7 Rule:

  • 68% of data within μ ± σ
  • 95% of data within μ ± 2σ
  • 99.7% of data within μ ± 3σ

Standard Normal: N(0, 1) - mean 0, variance 1

Z-score transformation: Z = (X - μ)/σ

4. Gamma Distribution

Models: Sum of exponential random variables

Parameters: α (shape), β (rate)

Mean: E[X] = α/β

Variance: Var(X) = α/β²

5. Beta Distribution

Models: Probabilities and proportions (values between 0 and 1)

Parameters: α, β (shape parameters)

Support: 0 ≤ x ≤ 1


Part 9: Joint Distributions

When you have multiple random variables, you need joint distributions.

Joint PMF (Discrete)

P(X = x, Y = y) = probability that X=x AND Y=y

Properties:

  • ΣΣ P(X = x, Y = y) = 1 (sum over all x and y)

Joint PDF (Continuous)

P((X,Y) ∈ A) = ∬ₐ f(x,y) dx dy

Properties:

  • ∬ f(x,y) dx dy = 1 (integral over entire plane)

Marginal Distributions

To get the distribution of just X from a joint distribution:

Discrete:

P(X = x) = Σ P(X = x, Y = y) for all y

Continuous:

f_X(x) = ∫ f(x,y) dy

Independence

X and Y are independent if:

Discrete: P(X = x, Y = y) = P(X = x) · P(Y = y) for all x, y

Continuous: f(x,y) = f_X(x) · f_Y(y) for all x, y


Part 10: Covariance and Correlation

Covariance

Measures how two variables vary together:

Cov(X,Y) = E[(X - μ_X)(Y - μ_Y)]
         = E[XY] - E[X]·E[Y]

Properties:

  • Cov(X,X) = Var(X)
  • Cov(X,Y) = Cov(Y,X)
  • If X and Y are independent: Cov(X,Y) = 0
  • Cov(aX + b, cY + d) = ac·Cov(X,Y)

Interpretation:

  • Cov(X,Y) > 0: Positive relationship (both increase together)
  • Cov(X,Y) < 0: Negative relationship (one increases, other decreases)
  • Cov(X,Y) = 0: No linear relationship

Correlation

Pearson correlation coefficient:

ρ(X,Y) = Cov(X,Y) / (σ_X · σ_Y)

Properties:

  • -1 ≤ ρ ≤ 1
  • ρ = 1: Perfect positive linear relationship
  • ρ = -1: Perfect negative linear relationship
  • ρ = 0: No linear relationship

Advantage over covariance: Scale-independent!


Part 11: Transformations of Random Variables

For Single Variable

If Y = g(X), how do we find the distribution of Y?

Method 1: CDF Method

  1. Find F_Y(y) = P(Y ≤ y) = P(g(X) ≤ y)
  2. Differentiate to get f_Y(y) = dF_Y(y)/dy

Method 2: Jacobian Method (for continuous)

If Y = g(X) and g is monotonic with inverse X = h(Y):

f_Y(y) = f_X(h(y)) · |dh(y)/dy|

Example: If X ~ N(0,1) and Y = X²

Then Y follows a Chi-square distribution with 1 degree of freedom.

For Multiple Variables

If we have transformations:

  • U = g₁(X,Y)
  • V = g₂(X,Y)

We use the Jacobian determinant:

f_{U,V}(u,v) = f_{X,Y}(x,y) · |J|

Where J is the Jacobian matrix determinant.


Part 12: Moment Generating Functions (MGF)

The MGF is a powerful tool for characterizing distributions.

Definition:

M_X(t) = E[e^(tX)]

For discrete:

M_X(t) = Σ e^(tx) · P(X = x)

For continuous:

M_X(t) = ∫ e^(tx) · f(x) dx

Why useful?

  1. Uniqueness: Each distribution has a unique MGF
  2. Moments: The n-th derivative at t=0 gives the n-th moment
M_X^(n)(0) = E[X^n]
  1. Sums: If X and Y are independent:
M_{X+Y}(t) = M_X(t) · M_Y(t)

Example: Bernoulli Distribution

X ~ Bernoulli(p)

M_X(t) = E[e^(tX)] = e^(t·0)·(1-p) + e^(t·1)·p
       = (1-p) + pe^t

Part 13: Central Limit Theorem (CLT)

One of the most important theorems in statistics!

Statement: If X₁, X₂, ..., X_n are independent, identically distributed random variables with mean μ and variance σ², then as n → ∞:

(X̄ - μ) / (σ/√n) → N(0, 1)

Or equivalently:

X̄ ~ N(μ, σ²/n) approximately

Where X̄ = (X₁ + X₂ + ... + X_n)/n is the sample mean.

In plain English: The average of many random variables (regardless of their original distribution) follows a normal distribution!

Practical implications:

  • Works for ANY distribution (as long as it has finite mean and variance)
  • Larger n = better approximation
  • Rule of thumb: n ≥ 30 is usually sufficient

Example: Roll a die 100 times and take the average. Even though individual rolls are uniform, the average will be approximately normal with mean 3.5 and variance (2.92/100).


Part 14: Law of Large Numbers (LLN)

Weak Law of Large Numbers:

As n → ∞, the sample mean X̄ converges in probability to the population mean μ:

P(|X̄ - μ| > ε) → 0 as n → ∞

For any ε > 0, no matter how small.

In plain English: If you repeat an experiment many times, the average result gets closer and closer to the expected value.

Difference from CLT:

  • LLN: Sample mean converges to population mean
  • CLT: Sample mean has an approximately normal distribution

Part 15: Practical Applications

1. Quality Control

  • Binomial: Number of defective items in a batch
  • Poisson: Number of defects per unit area

2. Finance

  • Normal: Stock returns
  • Exponential: Time between trades
  • Lognormal: Stock prices

3. Insurance

  • Poisson: Number of claims
  • Exponential/Gamma: Claim amounts

4. Reliability Engineering

  • Exponential: Component lifetimes
  • Weibull: Product failure times

5. Queuing Theory

  • Poisson: Customer arrivals
  • Exponential: Service times

6. Machine Learning

  • Bernoulli: Binary classification
  • Multinomial: Multi-class classification
  • Normal: Regression residuals

Part 16: Summary Table

Distribution Type Parameters Mean Variance Use Case
Bernoulli Discrete p p p(1-p) Single trial
Binomial Discrete n, p np np(1-p) n trials, count successes
Poisson Discrete λ λ λ Events in time/space
Geometric Discrete p 1/p (1-p)/p² Trials until success
Uniform Continuous a, b (a+b)/2 (b-a)²/12 Equal probability
Exponential Continuous λ 1/λ 1/λ² Waiting times
Normal Continuous μ, σ² μ σ² Natural phenomena
Gamma Continuous α, β α/β α/β² Sum of exponentials
Beta Continuous α, β α/(α+β) - Proportions

The Bottom Line

Random Variables are functions that assign numbers to random outcomes. They come in two types:

  • Discrete: Countable values (PMF)
  • Continuous: Uncountable values (PDF)

Distributions describe how probabilities are spread over possible values:

  • PMF/PDF: Probability at each point
  • CDF: Cumulative probability
  • Expected Value: Long-run average
  • Variance: Spread around the mean

Key Theorems:

  • CLT: Averages approach normality
  • LLN: Averages converge to expected value

Understanding random variables and distributions is fundamental to statistics, probability, data science, machine learning, and countless real-world applications!


Comments

Popular posts from this blog

Simple Linear Regression - and Related Regression Loss Functions

Today's Topics: a. Regression Algorithms  b. Outliers - Explained in Simple Terms c. Common Regression Metrics Explained d. Overfitting and Underfitting e. How are Linear and Non Linear Regression Algorithms used in Neural Networks [Future study topics] Regression Algorithms Regression algorithms are a category of machine learning methods used to predict a continuous numerical value. Linear regression is a simple, powerful, and interpretable algorithm for this type of problem. Quick Example: These are the scores of students vs. the hours they spent studying. Looking at this dataset of student scores and their corresponding study hours, can we determine what score someone might achieve after studying for a random number of hours? Example: From the graph, we can estimate that 4 hours of daily study would result in a score near 80. It is a simple example, but for more complex tasks the underlying concept will be similar. If you understand this graph, you will understand this blog. Sim...

What problems can AI Neural Networks solve

How does AI Neural Networks solve Problems? What problems can AI Neural Networks solve? Based on effectiveness and common usage, here's the ranking from best to least suitable for neural networks (Classification Problems, Regression Problems and Optimization Problems.) But first some Math, background and related topics as how the Neural Network Learn by training (Supervised Learning and Unsupervised Learning.)  Background Note - Mathematical Precision vs. Practical AI Solutions. Math can solve all these problems with very accurate results. While Math can theoretically solve classification, regression, and optimization problems with perfect accuracy, such calculations often require impractical amounts of time—hours, days, or even years for complex real-world scenarios. In practice, we rarely need absolute precision; instead, we need actionable results quickly enough to make timely decisions. Neural networks excel at this trade-off, providing "good enough" solutions in seco...

Activation Functions in Neural Networks

  A Guide to Activation Functions in Neural Networks 🧠 Question: Without activation function can a neural network with many layers be non-linear? Answer: Provided at the end of this document. Activation functions are a crucial component of neural networks. Their primary purpose is to introduce non-linearity , which allows the network to learn the complex, winding patterns found in real-world data. Without them, a neural network, no matter how deep, would just be a simple linear model. In the diagram below the f is the activation function that receives input and send output to next layers. Commonly used activation functions. 1. Sigmoid Function 2. Tanh (Hyperbolic Tangent) 3. ReLU (Rectified Linear Unit - Like an Electronic Diode) 4. Leaky ReLU & PReLU 5. ELU (Exponential Linear Unit) 6. Softmax 7. GELU, Swish, and SiLU 1. Sigmoid Function                       The classic "S-curve," Sigmoid squashes any input value t...