Skip to main content

Discrete Distributions - Bernoulli, Binomial and Poisson Distribution

Topics:

https://www.youtube.com/watch?v=bmdsROmXgGI [Probability Distributions A-Z! | Normal vs. Standard Normal vs. Poisson vs. Binomial vs. Bernoulli - Prof. Ryan Ahmed]

https://www.youtube.com/watch?v=8vHKCrNGPhY  [Think more rationally with Bayes’ rule | Steven Pinker @ Big Think]

Normal Distribution

https://www.youtube.com/watch?v=3VYupIsbLlY  [Normal Distribution (PDF, CDF, PPF) in 3 Minutes @ 3-Minute Data Science]

PDF (Probability Density Function): Bell-shaped curve showing probability density; area under curve = probability over intervals.

CDF (Cumulative Distribution Function): P(X ≤ x); S-shaped curve showing cumulative probability from left.

PPF (Percent Point Function): Inverse of CDF; given probability p, returns value x where P(X ≤ x) = p. Used for finding percentiles/quantiles.


see https://www.fourmilab.ch/rpkp/experiments/statistics.html

1. Random Variables
2. Distribution Functions
3. Discrete Distributions
4. Bernoulli Distribution
5. Binomial Distribution
6. Poisson Distribution
7. Variance of Discrete Distributions


Background

What is a probability?
Probability is a mathematical measure of how likely an event is to occur, expressed as a number between 0 (impossible) and 1 (certain), or as a percentage between 0% and 100%.

Probability is a way to quantize uncertainty by assigning numerical values between 0 and 1 to represent the likelihood of events occurring.

In mathematics, probability is the ratio of the number of favorable outcomes to the total number of possible outcomes in a sample space.

Probability = Number of favorable outcomes / Total number of possible outcomes

Or mathematically:

P(A) = n(A) / n(S)

Where:

  • P(A) = probability of event A
  • n(A) = number of favorable outcomes for event A
  • n(S) = total number of outcomes in the sample space S

What is Experiment: Rolling a standard six-sided die.

What is Event: Getting an even number.

What is Trial: The single roll of the die.

What is Outcome: Specific result of the experiment (e.g., I got a 4).

What is Sample Space: All possible outcomes of the experiment: {1, 2, 3, 4, 5, 6}

What is conditional probability?

Conditional probability is the probability of event A occurring given that event B has already occurred, written as P(A|B) = P(A∩B)/P(B).

Conditional probability with dice: You roll a single die. What's the probability of rolling an even number given that you rolled greater than 3?

P(even|>3) = 2/3

Why? Numbers greater than 3 are {4, 5, 6}. Among these three outcomes, two are even {4, 6}. So the probability is 2/3, not 1/2 as it would be without the condition.

(note in this example we are not rolling the dice two times; just once and saying values should be > 3 and also even. But the first event already happened (was > 3) so we got may be 4, 5, 6.  in this 4, 5 and 6 the odds of even are 4 or 6, therefore probability = 2/3


What is the probability of getting a sum of 9 with two dice.

The probability of getting a sum of 9 with two dice is 4/36 = 1/9.

Here's why: There are 36 total possible outcomes (6×6) when rolling two dice.

The favorable outcomes that sum to 9 are:

  • (3,6)
  • (4,5)
  • (5,4)
  • (6,3)

That's 4 favorable outcomes out of 36 total, so P(sum=9) = 4/36 = 1/9 ≈ 0.111 or 11


1. Random Variables
[see https://milindai.blogspot.com/2025/10/random-variables-and-distributions.html]
  1. A random variable is a function that assigns numerical values to outcomes of a random experiment (like mapping "heads" → 1, "tails" → 0).

  2. Two types exist: Discrete (countable values like 0, 1, 2, 3...) and Continuous (any value in an interval, like height or time).

  3. Probability distributions describe them: PMF for discrete (exact probabilities), PDF for continuous (probability densities over intervals).

  4. Key characteristics: Expected Value E[X] = average/mean, Variance Var(X) = spread/variability around the mean.

  5. They model uncertainty: Allow us to mathematically analyze random phenomena (coin flips, customer arrivals, stock prices, test scores) and make predictions.

2. Distribution Functions
[see https://milindai.blogspot.com/2025/10/random-variables-and-distributions.html]

A distribution function describes how probability is spread across possible values of a random variable. The probability density function (PDF) gives the likelihood of specific values for continuous variables, while probability mass function (PMF) does this for discrete variables. The cumulative distribution function (CDF) shows the probability that a variable takes a value less than or equal to a given point.

Most common distributions:

  • Normal (Gaussian): Bell curve, describes many natural phenomena
  • Binomial: Number of successes in n trials
  • Poisson: Count of events in fixed time
  • Exponential: Time between events
  • Uniform: Equal probability across range
  • Beta: Probabilities and proportions

Concepts:

Coin Flips





Sample Distribution:




Distribution functions describe how probabilities are spread across possible values of a random variable.

Three main types:

1. PMF (Probability Mass Function) - for discrete variables: Gives exact probability P(X = x) for each value. Example: P(die = 3) = 1/6.

2. PDF (Probability Density Function) - for continuous variables: Describes probability density; area under curve gives probability over intervals. P(exact value) = 0 for continuous variables.

3. CDF (Cumulative Distribution Function) - for both types: F(x) = P(X ≤ x), the probability that X is at most x. Always non-decreasing, ranges from 0 to 1.

Key properties: All probabilities must be non-negative and sum/integrate to 1.

Purpose: Distribution functions completely characterize a random variable's behavior, allowing us to calculate probabilities, expected values, and variances. They're essential for statistical inference, modeling real-world phenomena, and making data-driven decisions in science, engineering, finance, and machine learning.


3. Discrete Distributions

What are Discrete Distributions?

First, "discrete" just means countable, whole numbers - like 1, 2, 3 apples. Not 2.5 apples (you can't have half an apple in these problems). A distribution just shows how likely different outcomes are.


4. Bernoulli Distribution

The "Yes or No" situation

Imagine you're flipping a coin once. That's it. Just once. Heads = Success ✓, Tails = Failure ✗. Real life examples: Will it rain today? Did you pass the test? Did your favorite team win? The rule: Only TWO possible outcomes, and you only do it ONCE.

What Is It?

The Bernoulli Distribution is the simplest probability distribution. It models a single random experiment that has exactly two possible outcomes: Success (usually coded as 1) and Failure (usually coded as 0). Think of it as the fundamental building block of probability theory - the atomic unit of random events.

The Mathematical Definition

A random variable X follows a Bernoulli distribution if: X ~ Bernoulli(p), where p = probability of success (0 ≤ p ≤ 1) and 1-p (or q) = probability of failure.

The probability mass function (PMF) is:

P(X = 1) = p
P(X = 0) = 1 - p

Or written more formally: P(X = x) = p^x × (1-p)^(1-x) for x ∈ {0, 1}

Key Properties

Expected Value (Mean): E[X] = p. This makes intuitive sense. If you have a 70% chance of success (p = 0.7), on average, you expect 0.7 successes per trial.

Variance: Var(X) = p(1-p). The variance is highest when p = 0.5 (maximum uncertainty) and lowest when p is close to 0 or 1 (high certainty).

Standard Deviation: σ = √[p(1-p)]








Real-World Examples

Example 1 - Medical Testing: You take a COVID test. Success (X=1): Test is positive, p = 0.05 (5% infection rate in population). Failure (X=0): Test is negative, 1-p = 0.95.

Example 2 - Digital Marketing: A user clicks on your ad. Success (X=1): User clicks, p = 0.03 (3% click-through rate). Failure (X=0): User doesn't click, 1-p = 0.97.

Example 3 - Quality Control: You inspect one product. Success (X=1): Product is defective, p = 0.02. Failure (X=0): Product passes inspection, 1-p = 0.98.

Note: "Success" doesn't mean good - it just means the outcome we're tracking happened.

Important Characteristics

Memoryless: Each Bernoulli trial is independent. Past outcomes don't affect future outcomes. If you flip a coin and get heads 5 times in a row, the probability of heads on the 6th flip is STILL 0.5. The coin has no memory.

Binary Nature: There are only two outcomes. You can't have "partially successful" outcomes in a pure Bernoulli trial.

Foundation for Other Distributions: Binomial Distribution = Sum of n independent Bernoulli trials. Geometric Distribution = Number of Bernoulli trials until first success.

When to Use Bernoulli Distribution

Use it when: ✓ You have exactly ONE trial/experiment ✓ There are exactly TWO possible outcomes ✓ The probability of success is known (or can be estimated) ✓ You want to model a single yes/no event

Don't use it when: ✗ You have multiple trials (use Binomial instead) ✗ You have more than two outcomes (use Multinomial instead) ✗ The outcomes are continuous (use Normal, Exponential, etc.)

Connection to Other Concepts

If X₁, X₂, ..., Xₙ are independent Bernoulli(p) random variables, then Y = X₁ + X₂ + ... + Xₙ ~ Binomial(n, p). So a Binomial distribution is just the sum of multiple Bernoulli trials! In probability theory, Bernoulli random variables are often called indicator variables because they "indicate" whether an event occurred (1) or not (0).

Why It Matters

The Bernoulli distribution is fundamental because: (1) Simplicity - It's the simplest non-trivial probability distribution (2) Building Block - More complex distributions are built from it (3) Real Applications - Countless real-world scenarios are binary (pass/fail, yes/no, win/lose) (4) Statistical Inference - Forms the basis for hypothesis testing and A/B testing (5) Machine Learning - Binary classification problems use Bernoulli assumptions.

The Bottom Line: The Bernoulli Distribution is elegant in its simplicity: one trial, two outcomes, probability p. But don't let the simplicity fool you - it's the foundation upon which much of probability theory and statistics is built. Think of it as the "atom" of probability distributions - simple, fundamental, and essential for understanding everything else.


5. Binomial Distribution

The "How many times will I succeed?" situation

Now imagine you flip that coin 10 times instead of once. How many times will you get heads? Maybe 5 times? Maybe 7? Maybe 0? Real life examples: You take 20 free throws in basketball. How many will you make? You answer 50 true/false questions by guessing. How many will you get right? You text 15 friends. How many will reply within an hour?

The rules: You repeat the same thing multiple times (like flipping 10 times). Each try has the same chance of success. Each try is independent (one flip doesn't affect the next). The question it answers: "Out of N tries, how many times will I succeed?"

What Is It?

The Binomial Distribution models the number of successes in a fixed number of independent Bernoulli trials. Think of it this way: Instead of flipping a coin once (Bernoulli), you flip it 10 times and count how many heads you get. That count follows a Binomial Distribution.




The Mathematical Definition

A random variable X follows a Binomial distribution if: X ~ Binomial(n, p) or X ~ B(n, p), where n = number of trials (fixed, predetermined), p = probability of success on each trial (constant across all trials), and X = number of successes (can be 0, 1, 2, ..., n).

The probability mass function (PMF) is:

P(X = k) = C(n,k) × p^k × (1-p)^(n-k)

Where: C(n,k) = "n choose k" = n! / [k!(n-k)!] = number of ways to choose k successes from n trials, p^k = probability of k successes, and (1-p)^(n-k) = probability of (n-k) failures.

Breaking Down the Formula

Scenario: Flip a fair coin 3 times. What's P(exactly 2 heads)? Given n = 3 trials, p = 0.5 (fair coin), k = 2 successes (heads).

Step 1: How many ways can we get 2 heads in 3 flips? HHT, HTH, THH. That's C(3,2) = 3!/(2!×1!) = 3 ways.

Step 2: What's the probability of each specific sequence? P(HHT) = 0.5 × 0.5 × 0.5 = 0.125. Each sequence has the same probability: p^2 × (1-p)^1 = 0.5^2 × 0.5^1 = 0.125.

Step 3: Total probability P(X = 2) = 3 × 0.125 = 0.375 = 37.5%. This matches our formula: P(X = 2) = C(3,2) × 0.5^2 × 0.5^1 = 3 × 0.25 × 0.5 = 0.375 ✓

The Four Required Conditions (BINS)

For a situation to follow a Binomial Distribution, it MUST satisfy these four conditions: (1) Binary outcomes - Each trial has exactly two outcomes: success or failure. (2) Independent trials - The outcome of one trial doesn't affect another. (3) Number of trials is fixed - You decide beforehand how many times you'll repeat the experiment. (4) Same probability - The probability of success (p) stays constant across all trials.

Example that VIOLATES independence: Drawing cards from a deck WITHOUT replacement - after you draw one card, the probabilities change for the next draw. This is NOT binomial (it's actually hypergeometric). Example that SATISFIES all conditions: Flipping a coin 20 times - each flip is independent, fixed number of trials, same probability each time.

Key Properties

Expected Value (Mean): E[X] = np. This is intuitive! If you flip a coin 100 times (n=100) with p=0.5, you expect 50 heads. Example: Roll a die 60 times, count how many times you get a 6. n = 60, p = 1/6, E[X] = 60 × (1/6) = 10 sixes expected.

Variance: Var(X) = np(1-p). The variance is highest when p = 0.5 and decreases as p approaches 0 or 1.

Standard Deviation: σ = √[np(1-p)]. This tells you how much the number of successes typically varies from the mean.

Mode: The most likely number of successes is around np (the mean).

Real-World Examples

Example 1 - Quality Control: A factory produces light bulbs with a 2% defect rate. You randomly select 100 bulbs. Question: What's the probability that exactly 3 are defective? n = 100 bulbs, p = 0.02 (defect rate), k = 3 defects. P(X = 3) = C(100,3) × 0.02^3 × 0.98^97 = 161,700 × 0.000008 × 0.133 ≈ 0.182 = 18.2%. Expected defects: E[X] = 100 × 0.02 = 2 defective bulbs.

Example 2 - Medical Testing: A drug has a 70% cure rate. You treat 20 patients. Question: What's the probability that at least 15 are cured? n = 20 patients, p = 0.70. We want P(X ≥ 15) = P(X=15) + P(X=16) + P(X=17) + P(X=18) + P(X=19) + P(X=20). You'd need to calculate each term and sum them (or use software/tables). Expected cures: E[X] = 20 × 0.70 = 14 patients.

Example 3 - Basketball Free Throws: You're an 80% free throw shooter. You take 10 shots. Question: What's the probability you make exactly 8? n = 10 shots, p = 0.80, k = 8 makes. P(X = 8) = C(10,8) × 0.80^8 × 0.20^2 = 45 × 0.1678 × 0.04 ≈ 0.302 = 30.2%.

Example 4 - Survey Response: You email 50 customers. Each has a 15% chance of responding. Expected responses: E[X] = 50 × 0.15 = 7.5 ≈ 7-8 responses. Standard deviation: σ = √[50 × 0.15 × 0.85] = √6.375 ≈ 2.52. So you'd typically get 7-8 responses, give or take about 2-3.

The Shape of the Distribution

The binomial distribution's shape depends on n and p: When p = 0.5 (symmetric): The distribution is perfectly symmetric and bell-shaped, especially for large n. When p < 0.5 (right-skewed): More probability mass on the left (fewer successes more likely). When p > 0.5 (left-skewed): More probability mass on the right (more successes more likely). As n increases: The distribution becomes more bell-shaped and approaches a Normal distribution (this is the Central Limit Theorem in action!).

Calculating Probabilities

Exactly k successes: P(X = k) = C(n,k) × p^k × (1-p)^(n-k)

At most k successes: P(X ≤ k) = Σ P(X = i) for i = 0 to k

At least k successes: P(X ≥ k) = 1 - P(X ≤ k-1)

Between a and b successes: P(a ≤ X ≤ b) = Σ P(X = i) for i = a to b

Practical Calculation Example

Scenario: You're taking a 20-question multiple choice test. Each question has 4 options. You guess randomly on all questions. What's the probability you pass (get at least 12 correct)? Given: n = 20 questions, p = 0.25 (1 out of 4 chance), We want P(X ≥ 12). Method 1: Calculate directly (tedious) P(X ≥ 12) = P(X=12) + P(X=13) + ... + P(X=20). Method 2: Use complement rule (slightly easier) P(X ≥ 12) = 1 - P(X ≤ 11). Using software or calculator: P(X ≥ 12) ≈ 0.0009 = 0.09%. Interpretation: You have less than 0.1% chance of passing by pure guessing. Don't rely on luck! Expected score: E[X] = 20 × 0.25 = 5 correct answers.

Connection to Normal Distribution

When n is large and p is not too close to 0 or 1, the Binomial Distribution can be approximated by a Normal Distribution. Rule of thumb: Use Normal approximation when np ≥ 10 and n(1-p) ≥ 10. X ~ B(n, p) ≈ N(μ, σ²), where μ = np and σ² = np(1-p). Example: Flip a coin 100 times. X ~ B(100, 0.5). μ = 100 × 0.5 = 50, σ² = 100 × 0.5 × 0.5 = 25, σ = 5. So X ≈ N(50, 25), meaning you'll typically get between 45-55 heads (within 1 standard deviation).

Common Pitfalls & Misconceptions

❌ Mistake 1: Confusing n and X - n = total number of trials (fixed), X = number of successes (random variable). ❌ Mistake 2: Not checking independence - If outcomes affect each other, it's NOT binomial! ❌ Mistake 3: Changing probability - If p changes between trials, it's NOT binomial! ❌ Mistake 4: Forgetting the combination term - The C(n,k) matters! Order doesn't matter in binomial, but counting arrangements does.

When to Use Binomial Distribution

✓ Use it when: Fixed number of identical trials, Each trial is independent, Binary outcomes (success/failure), Constant probability of success, You want to know: "How many successes?"

✗ Don't use it when: Trials aren't independent (use Hypergeometric), Probability changes (use different model), Continuous outcomes (use Normal, Exponential, etc.), Counting events over time (use Poisson).

The Bottom Line: The Binomial Distribution is your go-to tool when you're repeating the same yes/no experiment multiple times and counting successes. It's incredibly practical - from quality control to medical trials to sports statistics. Key takeaway: Binomial = "Repeat a Bernoulli trial n times and count how many successes you get." Master this distribution, and you'll have a powerful tool for analyzing countless real-world situations!


6. Poisson Distribution

The "How many random things will happen?" situation

This one's trickier! Imagine you're sitting in a park for one hour, counting how many dogs walk by. You don't know exactly when dogs will come, but you know on average, maybe 5 dogs pass by each hour. Real life examples: How many text messages will you get in the next hour? How many cars will pass your house in 10 minutes? How many goals will be scored in a soccer game? How many customers walk into a store each hour?

The key difference from Binomial: You're NOT doing "10 tries" - events just happen randomly over time. You only know the average rate (like "5 dogs per hour"). The number could be 0, or 3, or 10, or even 20! The question it answers: "How many random events will happen in a specific time period?"

What Is It?

The Poisson Distribution models the number of events occurring in a fixed interval of time or space when these events happen at a known average rate and are independent of each other. Key difference from Binomial: Binomial: You decide to flip 10 coins → "How many heads?" Poisson: Cars randomly pass by → "How many cars in the next hour?" With Binomial, you control n (number of trials). With Poisson, events just happen randomly, and you're counting them.

In the Poisson distribution, lambda (λ) is simply the average number of times an event happens in a specific period of time or space.

Think of it as the "expected" or "normal" number of occurrences.


The Text Message Analogy 📱

Imagine you're counting the text messages you get.

  • The Interval: You decide to track texts for a specific interval, like one hour.

  • The Average: You look at your phone over many hours and find that, on average, you receive 4 texts per hour.

In this scenario, lambda (λ) = 4.

The Poisson distribution then helps you figure out the probability of getting a different number of texts in any given hour. For example, what's the chance you get exactly 2 texts? Or 7 texts? Or even 0 texts? Lambda is the key piece of information you need to answer those questions.


How Lambda Changes the Graph

Your image shows how changing the average (lambda) affects the probabilities.

  • Red Line (λ = 1): Imagine a very slow hour where you only expect 1 text on average. The graph is bunched up on the left, showing you'll most likely get 0 or 1 text, and getting 4 is almost impossible.

  • Green Line (λ = 4): This is our example. The peak of the graph is right at 4, meaning 4 texts is the most likely outcome.

  • Blue Line (λ = 10): Now imagine your phone is blowing up, and you expect 10 texts on average. The whole graph shifts to the right. The peak is now at 10, and getting a low number like 2 is very unlikely.


Key Takeaway

So, lambda (λ) is just the average rate of events that you use as a starting point to calculate the probability of seeing other outcomes.

The Mathematical Definition

A random variable X follows a Poisson distribution if: X ~ Poisson(λ) or X ~ Pois(λ), where λ (lambda) = average rate of events per interval (λ > 0) and X = actual number of events that occur (X = 0, 1, 2, 3, ...).

The probability mass function (PMF) is:

P(X = k) = (λ^k × e^(-λ)) / k!

Where: k = number of events (0, 1, 2, 3, ...), e ≈ 2.71828 (Euler's number), and k! = k factorial = k × (k-1) × (k-2) × ... × 1.

Breaking Down the Formula

Scenario: On average, 3 customers enter a store per hour (λ = 3). What's the probability that exactly 5 customers enter in the next hour? P(X = 5) = (3^5 × e^(-3)) / 5! = (243 × 0.0498) / 120 = 12.10 / 120 = 0.101 = 10.1%. What each part means: 3^5 = likelihood of 5 events when average is 3, e^(-3) = normalization factor (ensures all probabilities sum to 1), 5! = accounts for the different orderings of events.

The Three Required Conditions

For a situation to follow a Poisson Distribution, it MUST satisfy these conditions: (1) Events occur independently - One event happening doesn't affect the probability of another event. Example: Customers entering a store are (mostly) independent of each other. Counter-example: People entering a movie theater - they often come in groups, violating independence. (2) Events occur at a constant average rate - The rate (λ) doesn't change over the time period you're observing. Example: Emails arriving at a constant average of 5 per hour throughout the day. Counter-example: Website traffic that spikes during lunch hours - rate changes, so not strictly Poisson. (3) Two events cannot occur at exactly the same instant - Events happen one at a time (technically, the probability of 2+ events in an infinitesimally small time interval approaches 0). Example: Phone calls arriving at a call center - each call starts at a distinct moment.

Key Properties

Expected Value (Mean): E[X] = λ. The mean equals the rate parameter. If λ = 4 events per hour, you expect 4 events on average.

Variance: Var(X) = λ. Unique property! The variance also equals λ. This is a special characteristic of Poisson.

Standard Deviation: σ = √λ

Mode: The most likely value is around λ (specifically, it's ⌊λ⌋ if λ is not an integer).

Mean = Variance: E[X] = Var(X) = λ. This is the signature property of Poisson distribution. If you observe data where mean ≈ variance, Poisson might be a good model!

Real-World Examples

Example 1 - Call Center: A call center receives an average of 8 calls per hour. Question: What's the probability they receive exactly 10 calls in the next hour? λ = 8 calls/hour, k = 10 calls. P(X = 10) = (8^10 × e^(-8)) / 10! = (1,073,741,824 × 0.000335) / 3,628,800 ≈ 0.099 = 9.9%. Question: What's the probability they receive 0 calls? P(X = 0) = (8^0 × e^(-8)) / 0! = (1 × 0.000335) / 1 = 0.000335 = 0.0335%. Very unlikely! With λ = 8, getting zero calls is rare.

Example 2 - Website Traffic: Your website gets an average of 200 visitors per hour. Question: What's the probability of getting more than 220 visitors in the next hour? λ = 200 visitors/hour. We want P(X > 220). This would require summing many terms or using software. Using Normal approximation (since λ is large): μ = 200, σ = √200 ≈ 14.14. Using z-score: z = (220 - 200) / 14.14 ≈ 1.41. P(X > 220) ≈ P(Z > 1.41) ≈ 0.079 = 7.9%.

Example 3 - Emergency Room: An ER sees an average of 2.5 trauma cases per night. Question: What's the probability they see exactly 0 trauma cases tonight? P(X = 0) = (2.5^0 × e^(-2.5)) / 0! = e^(-2.5) ≈ 0.082 = 8.2%. Question: What's the probability they see 5 or more cases? P(X ≥ 5) = 1 - P(X ≤ 4) = 1 - [P(X=0) + P(X=1) + P(X=2) + P(X=3) + P(X=4)]. You'd calculate each term and subtract from 1. Expected cases per night: E[X] = 2.5 cases.

Example 4 - Radioactive Decay: A radioactive sample emits an average of 50 particles per minute. Question: What's the probability of detecting exactly 55 particles in the next minute? λ = 50 particles/minute, k = 55. P(X = 55) = (50^55 × e^(-50)) / 55! ≈ 0.040 = 4.0%. This is a classic Poisson scenario - radioactive decay events are independent and random.

Example 5 - Typos in a Book: A book has an average of 1.2 typos per page. Question: What's the probability a random page has no typos? P(X = 0) = (1.2^0 × e^(-1.2)) / 0! = e^(-1.2) ≈ 0.301 = 30.1%. Question: What's the probability a page has 3 or more typos? P(X ≥ 3) = 1 - P(X ≤ 2) = 1 - [P(X=0) + P(X=1) + P(X=2)] = 1 - [0.301 + 0.361 + 0.217] = 1 - 0.879 = 0.121 = 12.1%.

The Shape of the Distribution

The Poisson distribution's shape depends on λ: When λ is small (λ < 1): Strongly right-skewed, P(X = 0) is the highest probability, Most values are 0 or 1. When λ is moderate (1 ≤ λ ≤ 10): Still right-skewed but less extreme, More spread out, Peak moves to the right. When λ is large (λ > 10): Approaches symmetry, Becomes more bell-shaped, Starts to look like a Normal distribution. Rule of thumb: When λ ≥ 10, you can approximate Poisson with Normal distribution: X ~ N(λ, λ).

Calculating Probabilities

Exactly k events: P(X = k) = (λ^k × e^(-λ)) / k!

At most k events: P(X ≤ k) = Σ P(X = i) for i = 0 to k

At least k events: P(X ≥ k) = 1 - P(X ≤ k-1)

Between a and b events: P(a ≤ X ≤ b) = Σ P(X = i) for i = a to b

Time Scaling Property

One of the coolest features of Poisson is how it scales with time! If λ events occur per hour, then: λ/2 events occur per 30 minutes, 2λ events occur per 2 hours, λt events occur per t hours. Example: If you receive 12 emails per hour on average: Per 30 minutes: λ = 6 emails, Per 15 minutes: λ = 3 emails, Per 2 hours: λ = 24 emails. Practical calculation: If 5 customers arrive per hour, what's P(exactly 2 arrive in 20 minutes)? 20 minutes = 1/3 hour, λ = 5 × (1/3) = 5/3 ≈ 1.67 customers per 20 min. P(X = 2) = (1.67^2 × e^(-1.67)) / 2! = (2.79 × 0.188) / 2 ≈ 0.262 = 26.2%.

Poisson as Limit of Binomial

Here's a deep connection: Poisson is actually a special case of Binomial! When n → ∞ (very large number of trials), p → 0 (very small probability per trial), and np = λ (constant), then Binomial(n, p) → Poisson(λ). Intuitive explanation: Imagine dividing an hour into n tiny intervals. In each interval, there's a tiny probability p of an event. As n → ∞ and p → 0 (while keeping np = λ constant), you get Poisson! Example: 1 million lottery tickets sold (n = 1,000,000), Probability of any single ticket winning = 0.000005 (p = 0.000005), Expected winners = np = 5. The exact distribution of winners is Binomial(1,000,000, 0.000005), but this is practically identical to Poisson(5) and much easier to calculate!

Practical Calculation Example

Scenario: A busy intersection sees an average of 4 accidents per year. What's the probability of: a) No accidents this year? b) Exactly 3 accidents? c) More than 5 accidents? Given: λ = 4 accidents/year. a) P(X = 0): P(X = 0) = (4^0 × e^(-4)) / 0! = e^(-4) ≈ 0.0183 = 1.83%. b) P(X = 3): P(X = 3) = (4^3 × e^(-4)) / 3! = (64 × 0.0183) / 6 ≈ 0.195 = 19.5%. c) P(X > 5): P(X > 5) = 1 - P(X ≤ 5) = 1 - [P(0) + P(1) + P(2) + P(3) + P(4) + P(5)]. Calculating each: P(0) = 0.0183, P(1) = 0.0733, P(2) = 0.1465, P(3) = 0.1954, P(4) = 0.1954, P(5) = 0.1563. Sum = 0.7852. P(X > 5) = 1 - 0.7852 = 0.2148 = 21.48%.

Common Applications

Telecommunications: Number of phone calls per hour, Network packet arrivals, Server requests. Healthcare: Number of patients arriving at ER, Disease outbreaks in a region, Birth defects per 1000 births. Traffic & Transportation: Cars passing a point per minute, Accidents per year at an intersection, Flight delays per day. Natural Sciences: Radioactive decay events, Meteor strikes, Earthquakes in a region. Business & Finance: Customer arrivals at a store, Defects in manufacturing, Insurance claims per month. Sports: Goals scored in soccer, No-hitters in baseball season, Injuries per team per season. Digital World: Website crashes per month, Spam emails per day, Bugs found in software testing.

Common Pitfalls & Misconceptions

❌ Mistake 1: Assuming independence when it doesn't exist - If events come in clusters (like customers in groups), Poisson isn't appropriate. ❌ Mistake 2: Using when rate changes - If λ varies over time (rush hour vs. night), you need a different model. ❌ Mistake 3: Confusing rate with count - λ = average rate (parameter), X = actual count (random variable). ❌ Mistake 4: Wrong time units - Make sure your λ and time interval match! Wrong: λ = 10/hour, asking about 15 minutes without adjusting. Right: λ = 10 × (15/60) = 2.5 for 15 minutes. ❌ Mistake 5: Forgetting mean = variance - If your data has variance >> mean, it's probably not Poisson (might be overdispersed).

Poisson vs. Binomial: When to Use Which?

Feature Binomial Poisson
Number of trials Fixed (n) Not fixed/infinite
You decide How many tries Just count what happens
Best for "n attempts, count successes" "Random events over time/space"
Example 100 coin flips Cars passing in an hour
Probability per trial Constant (p) Not applicable
When events are Controlled by you Random, spontaneous

Conversion: When n is large and p is small, Binomial ≈ Poisson with λ = np

Advanced: Sum of Poisson Random Variables

If X₁ ~ Poisson(λ₁) and X₂ ~ Poisson(λ₂) are independent, then X₁ + X₂ ~ Poisson(λ₁ + λ₂). Example: Store A: 5 customers/hour, Store B: 8 customers/hour, Total in both stores: 13 customers/hour. This additivity property makes Poisson very convenient for combining events from multiple sources!

The Bottom Line: The Poisson Distribution is your go-to tool when events occur randomly and independently over time or space at a known average rate. It's elegant, powerful, and appears everywhere in the real world. Key differences from what you've learned: Bernoulli: One trial, yes/no. Binomial: n trials, count successes. Poisson: Random events happening, count how many. Key insight: Poisson models rare events - things that don't happen very often in each tiny moment, but over time accumulate. That's why λ can be interpreted as np when n → ∞ and p → 0. Remember: When mean ≈ variance in your data, think Poisson! Master this distribution, and you'll have a powerful tool for modeling countless real-world phenomena from website traffic to radioactive decay to customer arrivals!


Quick Comparison Table

Distribution Question Example
Bernoulli Will it happen? (once) Did I make this one free throw?
Binomial How many successes in N tries? I shot 10 free throws - how many went in?
Poisson How many random events in a time period? How many shooting stars will I see in one hour?

The main difference: Bernoulli/Binomial = You control how many tries (you decide to flip 10 times). Poisson = Events happen randomly; you just count them (dogs walk by whenever they want).

7. Variance of Discrete Distributions: Variance measures how spread out discrete random variable values are around their mean. Calculated as Var(X)=Σ(x-μ)²·P(X=x) where μ is the expected value, or equivalently Var(X)=E[X²]-(E[X])² using the computational formula. For common distributions: Bernoulli has Var(X)=p(1-p), Binomial has Var(X)=np(1-p), Poisson uniquely has Var(X)=λ (equals its mean), and Geometric has Var(X)=(1-p)/p². Variance is always non-negative and measured in squared units. Standard deviation σ=√Var(X) provides interpretable spread in original units. Higher variance indicates greater variability and uncertainty. Understanding variance is crucial for risk assessment, quality control, prediction intervals, and statistical inference in discrete probability models.

Comments

Popular posts from this blog

Simple Linear Regression - and Related Regression Loss Functions

Today's Topics: a. Regression Algorithms  b. Outliers - Explained in Simple Terms c. Common Regression Metrics Explained d. Overfitting and Underfitting e. How are Linear and Non Linear Regression Algorithms used in Neural Networks [Future study topics] Regression Algorithms Regression algorithms are a category of machine learning methods used to predict a continuous numerical value. Linear regression is a simple, powerful, and interpretable algorithm for this type of problem. Quick Example: These are the scores of students vs. the hours they spent studying. Looking at this dataset of student scores and their corresponding study hours, can we determine what score someone might achieve after studying for a random number of hours? Example: From the graph, we can estimate that 4 hours of daily study would result in a score near 80. It is a simple example, but for more complex tasks the underlying concept will be similar. If you understand this graph, you will understand this blog. Sim...

What problems can AI Neural Networks solve

How does AI Neural Networks solve Problems? What problems can AI Neural Networks solve? Based on effectiveness and common usage, here's the ranking from best to least suitable for neural networks (Classification Problems, Regression Problems and Optimization Problems.) But first some Math, background and related topics as how the Neural Network Learn by training (Supervised Learning and Unsupervised Learning.)  Background Note - Mathematical Precision vs. Practical AI Solutions. Math can solve all these problems with very accurate results. While Math can theoretically solve classification, regression, and optimization problems with perfect accuracy, such calculations often require impractical amounts of time—hours, days, or even years for complex real-world scenarios. In practice, we rarely need absolute precision; instead, we need actionable results quickly enough to make timely decisions. Neural networks excel at this trade-off, providing "good enough" solutions in seco...

Activation Functions in Neural Networks

  A Guide to Activation Functions in Neural Networks 🧠 Question: Without activation function can a neural network with many layers be non-linear? Answer: Provided at the end of this document. Activation functions are a crucial component of neural networks. Their primary purpose is to introduce non-linearity , which allows the network to learn the complex, winding patterns found in real-world data. Without them, a neural network, no matter how deep, would just be a simple linear model. In the diagram below the f is the activation function that receives input and send output to next layers. Commonly used activation functions. 1. Sigmoid Function 2. Tanh (Hyperbolic Tangent) 3. ReLU (Rectified Linear Unit - Like an Electronic Diode) 4. Leaky ReLU & PReLU 5. ELU (Exponential Linear Unit) 6. Softmax 7. GELU, Swish, and SiLU 1. Sigmoid Function                       The classic "S-curve," Sigmoid squashes any input value t...