Skip to main content

Bayes' Theorem Interview Questions

Interview questions about Bayes' theorem and related topics:

Questions

1. What is the correct formula for Bayes' theorem?

A) P(A|B) = P(A) × P(B)

B) P(A|B) = P(B|A) × P(A) / P(B)

C) P(A|B) = P(A) + P(B) - P(A∩B)

D) P(A|B) = P(A∩B) / P(A)

E) P(A|B) = P(B) / P(A|B)

2. A medical test for a disease is 95% accurate (both sensitivity and specificity). If 1% of the population has the disease and someone tests positive, what is the probability they actually have the disease?

A) 95%

B) 50%

C) 16.1%

D) 5%

E) 99%

3. Which of the following best describes the prior probability in Bayesian inference?

A) The probability calculated after observing evidence

B) The probability of the evidence occurring

C) The initial belief about an event before observing new evidence

D) The likelihood function

E) The marginal probability

4. What is the relationship between conditional probability and joint probability?

A) P(A∩B) = P(A) + P(B)

B) P(A∩B) = P(A|B) × P(B)

C) P(A∩B) = P(A) / P(B)

D) P(A∩B) = P(A) - P(B)

E) P(A∩B) = P(A|B) + P(B|A)

5. In Bayesian statistics, what role does the likelihood function serve?

A) It represents the prior distribution

B) It normalizes the posterior distribution

C) It quantifies how probable the observed data is given different parameter values

D) It represents the posterior distribution

E) It calculates the marginal probability

6. Two events A and B are independent. If P(A) = 0.3 and P(B) = 0.4, what is P(A|B)?

A) 0.12

B) 0.3

C) 0.4

D) 0.7

E) 0.75

7. What is the law of total probability?

A) P(A) = P(A|B) × P(B)

B) P(A) = Σ P(A|Bi) × P(Bi) for all partitions Bi

C) P(A) = P(B|A) × P(A) / P(B)

D) P(A) = 1 - P(A')

E) P(A) = P(A∩B) + P(A∩B')

8. In a Naive Bayes classifier, what key assumption is made about the features?

A) They follow a normal distribution

B) They are conditionally independent given the class

C) They are linearly separable

D) They have equal variance

E) They are uniformly distributed

9. A bag contains 3 red balls and 2 blue balls. You draw 2 balls without replacement. If the first ball is red, what's the probability the second is also red?

A) 3/5

B) 1/2

C) 2/5

D) 3/4

E) 1/4

10. What distinguishes Maximum A Posteriori (MAP) estimation from Maximum Likelihood Estimation (MLE)?

A) MAP uses more data than MLE

B) MAP incorporates prior information while MLE does not

C) MAP is always more accurate than MLE

D) MAP only works with normal distributions

E) MAP is computationally simpler than MLE

11. What is the "Bayesian Trap" in decision-making?

A) Always choosing the most likely outcome

B) Ignoring prior probabilities

C) Over-relying on initial beliefs and not updating sufficiently with new evidence

D) Using frequentist methods instead of Bayesian

E) Calculating probabilities incorrectly

12. In the Monty Hall problem, after a door with a goat is revealed, what is the probability of winning if you switch doors?

A) 1/3

B) 1/2

C) 2/3

D) 3/4

E) 1

13. What distinguishes the Bayesian interpretation from the frequentist interpretation of probability?

A) Bayesian probability represents degrees of belief, frequentist represents long-run frequencies

B) Bayesian uses more data

C) Frequentist is always more accurate

D) Bayesian only works with normal distributions

E) There is no difference

14. What is a conjugate prior in Bayesian statistics?

A) A prior that equals the posterior

B) A prior that, when combined with the likelihood, yields a posterior in the same family

C) A uniformly distributed prior

D) A prior with zero variance

E) A prior that contradicts the data

15. If P(A) = 0.4, P(B) = 0.3, and P(A∩B) = 0.12, are events A and B independent?

A) Yes, because P(A∩B) = P(A) × P(B)

B) No, because P(A∩B) ≠ P(A) + P(B)

C) Cannot determine without P(A|B)

D) No, they are mutually exclusive

E) Yes, because P(A∩B) > 0

16. What is the base rate fallacy?

A) Using the wrong formula for Bayes' theorem

B) Ignoring or underweighting the prior probability when making judgments

C) Assuming all events are equally likely

D) Using frequentist methods

E) Overestimating rare events

17. In Bayesian updating, what happens to the posterior as you collect more data?

A) It always increases

B) It becomes less influenced by the prior and more by the likelihood

C) It converges to zero

D) It becomes uniform

E) It remains constant

18. What is the marginal likelihood P(B) in Bayes' theorem also known as?

A) Prior probability

B) Posterior probability

C) Evidence or normalization constant

D) Likelihood function

E) Conditional probability

19. A spam filter uses Bayes' theorem. If P(spam) = 0.4, P(word|spam) = 0.8, P(word|not spam) = 0.1, what is P(spam|word)?

A) 0.32

B) 0.89

C) 0.5

D) 0.8

E) 0.4

20. What is a flat or uniform prior?

A) A prior that assigns equal probability to all possible values

B) A prior with zero variance

C) A prior that is normally distributed

D) A prior that equals zero everywhere

E) A prior that increases linearly

21. In the context of hypothesis testing, what does the Bayes factor represent?

A) The prior odds

B) The ratio of posterior probabilities

C) The ratio of the likelihood of data under two competing hypotheses

D) The p-value

E) The confidence interval

22. What is empirical Bayes?

A) Using experimental data only

B) Estimating prior parameters from the data itself

C) Rejecting Bayesian methods

D) Using uniform priors always

E) Calculating posteriors experimentally

23. If a rare disease affects 0.1% of the population and a test is 99% accurate, what percentage of positive tests are false positives?

A) 1%

B) 9%

C) 50%

D) 91%

E) 99%

24. What is the prosecutor's fallacy?

A) Confusing P(evidence|innocent) with P(innocent|evidence)

B) Always convicting the defendant

C) Using Bayes' theorem in court

D) Ignoring evidence

E) Assuming guilt without evidence

25. In Bayesian networks, what does d-separation determine?

A) The prior distribution

B) Conditional independence between nodes

C) The likelihood function

D) The marginal probability

E) The posterior distribution

26. What is Jeffrey's prior?

A) A prior based on personal belief

B) A non-informative prior that is invariant under reparameterization

C) A uniform prior

D) A normal prior

E) A prior that maximizes entropy

27. How does Bayesian model comparison differ from frequentist hypothesis testing?

A) It uses p-values

B) It computes posterior probabilities for different models

C) It always rejects the null hypothesis

D) It doesn't use data

E) It only works with two models

28. What is the difference between P(A|B) and P(B|A)?

A) They are always equal

B) P(A|B) is the probability of A given B, while P(B|A) is probability of B given A

C) One is frequentist, the other Bayesian

D) P(A|B) is always larger

E) There is no difference

29. In a Bayesian framework, what is a credible interval?

A) The same as a confidence interval

B) An interval containing a specified probability mass of the posterior distribution

C) The range of the prior

D) The likelihood function's domain

E) Always 95% of the data

30. What is the main criticism of using improper priors?

A) They are too informative

B) They may lead to improper posteriors

C) They are always uniform

D) They require too much computation

E) They ignore the data

31. How does Occam's Razor relate to Bayesian inference?

A) It doesn't relate at all

B) Simpler models automatically get higher posterior probability through the marginal likelihood

C) Complex models are always preferred

D) It only applies to frequentist statistics

E) It requires uniform priors

32. What is the Cromwell's Rule in Bayesian statistics?

A) Always use uniform priors

B) Avoid using prior probabilities of 0 or 1

C) Update beliefs daily

D) Ignore unlikely events

E) Use frequentist methods for small samples

33. In sequential Bayesian updating, the posterior from one update becomes:

A) Discarded

B) The prior for the next update

C) The likelihood

D) The evidence

E) Marginalized out

34. What is the relationship between Bayes' theorem and machine learning?

A) They are unrelated

B) Bayes' theorem underlies many ML algorithms like Naive Bayes and Bayesian networks

C) ML replaces Bayes' theorem

D) Bayes' theorem only works for small data

E) ML is always frequentist

35. A factory has 3 machines producing widgets. Machine A (40% of production) has 2% defect rate, Machine B (35% of production) has 3% defect rate, Machine C (25% of production) has 5% defect rate. If a widget is defective, what's the probability it came from Machine A?

A) 0.4

B) 0.26

C) 0.02

D) 0.35

E) 0.08

36. What is the Dutch Book argument in Bayesian probability?

A) A gambling strategy

B) A coherence argument showing that violating probability laws leads to sure losses

C) A method for calculating priors

D) A frequentist critique

E) A computational algorithm

37. In Bayesian AB testing, what advantage does it have over frequentist methods?

A) It's always faster

B) It can provide probability statements about which variant is better

C) It doesn't require data

D) It always chooses variant A

E) It ignores sample size

38. What is the Jeffreys-Lindley paradox?

A) Priors don't matter

B) A situation where Bayesian and frequentist methods can give contradictory results with large samples

C) Posteriors equal priors

D) Evidence is always ignored

E) All hypotheses are equally likely

39. How does the Bayesian interpretation handle the concept of "probability of a hypothesis"?

A) It's undefined

B) It assigns degrees of belief to hypotheses

C) It only uses 0 or 1

D) It requires infinite data

E) It's the same as p-values

40. What is the principle of insufficient reason (principle of indifference)?

A) Never update beliefs

B) Assign equal probabilities when no information distinguishes outcomes

C) Always reject the null hypothesis

D) Use zero probability for unknowns

E) Ignore prior information


Step-by-Step Answers

1. Answer: B Bayes' theorem states: P(A|B) = P(B|A) × P(A) / P(B) This formula allows us to reverse conditional probabilities. It relates the conditional probability P(A|B) to P(B|A), incorporating the prior probability P(A) and the marginal probability P(B).

2. Answer: C Let D = has disease, + = tests positive

  • P(D) = 0.01 (prior probability)
  • P(+|D) = 0.95 (sensitivity)
  • P(-|¬D) = 0.95 (specificity), so P(+|¬D) = 0.05
  • P(+) = P(+|D)×P(D) + P(+|¬D)×P(¬D) = 0.95×0.01 + 0.05×0.99 = 0.0095 + 0.0495 = 0.059
  • P(D|+) = P(+|D)×P(D)/P(+) = (0.95×0.01)/0.059 = 0.0095/0.059 ≈ 0.161 or 16.1%

3. Answer: C The prior probability represents our initial belief or knowledge about an event before we observe any new evidence. It's updated through Bayesian inference when new data becomes available, resulting in the posterior probability.

4. Answer: B The joint probability P(A∩B) equals the conditional probability P(A|B) times the marginal probability P(B). This can also be written as P(A∩B) = P(B|A) × P(A). This relationship is fundamental to understanding Bayes' theorem.

5. Answer: C The likelihood function L(θ|data) quantifies how probable the observed data is for different values of the parameter θ. It's not a probability distribution over θ, but rather indicates which parameter values make the observed data more or less likely.

6. Answer: B When events A and B are independent, P(A|B) = P(A). The occurrence of B doesn't affect the probability of A. Therefore, P(A|B) = 0.3. Independence means P(A∩B) = P(A) × P(B).

7. Answer: B The law of total probability states that if {B₁, B₂, ..., Bₙ} forms a partition of the sample space, then P(A) = Σ P(A|Bᵢ) × P(Bᵢ). Option E is also a special case where the partition is {B, B'}.

8. Answer: B Naive Bayes assumes conditional independence of features given the class. This means P(x₁, x₂, ..., xₙ|class) = P(x₁|class) × P(x₂|class) × ... × P(xₙ|class). This "naive" assumption greatly simplifies computation.

9. Answer: B After drawing one red ball:

  • Remaining balls: 2 red, 2 blue (4 total)
  • P(second red | first red) = 2/4 = 1/2 This is a conditional probability problem where the sample space changes after the first draw.

10. Answer: B MAP estimation maximizes the posterior probability: argmax P(θ|data) = argmax P(data|θ) × P(θ) MLE maximizes only the likelihood: argmax P(data|θ) The key difference is that MAP incorporates prior information P(θ) about the parameters, while MLE uses only the likelihood from the observed data.

11. Answer: C The Bayesian Trap refers to the tendency to over-rely on initial beliefs (priors) and not adequately update them when presented with new evidence. This can lead to confirmation bias and stuck beliefs despite contradicting data.

12. Answer: C Initially, the car is behind one of three doors with probability 1/3 each. When you pick a door, P(car behind your door) = 1/3. After host reveals a goat, P(car behind your door) remains 1/3, so P(car behind other door) = 2/3. Switching doubles your chances.

13. Answer: A Bayesian probability represents degrees of belief or uncertainty about propositions, while frequentist probability represents long-run frequencies of events in repeated trials. Bayesians can assign probabilities to hypotheses; frequentists cannot.

14. Answer: B A conjugate prior is one that, when multiplied by the likelihood function, produces a posterior distribution in the same family as the prior. For example, Beta prior with Binomial likelihood gives Beta posterior.

15. Answer: A For independence, we need P(A∩B) = P(A) × P(B) Check: 0.12 = 0.4 × 0.3 = 0.12 ✓ The events are independent.

16. Answer: B The base rate fallacy is the tendency to ignore or underweight base rates (prior probabilities) when making probability judgments, especially when presented with specific case information.

17. Answer: B As more data is collected, the likelihood term dominates and the influence of the prior diminishes. The posterior becomes increasingly determined by the data rather than initial beliefs.

18. Answer: C P(B) is called the evidence or normalization constant. It ensures the posterior is a proper probability distribution. It can be calculated using the law of total probability.

19. Answer: B Using Bayes: P(spam|word) = P(word|spam) × P(spam) / P(word)

  • P(word) = P(word|spam)×P(spam) + P(word|not spam)×P(not spam)
  • P(word) = 0.8×0.4 + 0.1×0.6 = 0.32 + 0.06 = 0.38
  • P(spam|word) = (0.8×0.4)/0.38 = 0.32/0.38 ≈ 0.842 ≈ 0.89

20. Answer: A A flat or uniform prior assigns equal probability density to all possible parameter values in the range. It represents maximum uncertainty or no prior knowledge about the parameter.

21. Answer: C The Bayes factor is the ratio of the likelihood of the observed data under one hypothesis to the likelihood under another hypothesis: BF = P(data|H₁) / P(data|H₀).

22. Answer: B Empirical Bayes estimates prior parameters from the data itself, rather than specifying them based on external information. It's a compromise between full Bayesian and frequentist approaches.

23. Answer: D

  • P(disease) = 0.001, P(no disease) = 0.999
  • P(+|disease) = 0.99, P(+|no disease) = 0.01
  • P(+) = 0.99×0.001 + 0.01×0.999 = 0.00099 + 0.00999 = 0.01098
  • P(disease|+) = (0.99×0.001)/0.01098 ≈ 0.09
  • Therefore, false positive rate = 1 - 0.09 = 0.91 or 91%

24. Answer: A The prosecutor's fallacy confuses P(evidence|innocent) with P(innocent|evidence). For example, saying "the probability of this evidence if innocent is 1 in a million" doesn't mean "the probability of innocence is 1 in a million."

25. Answer: B D-separation (directional separation) is a criterion for determining conditional independence between nodes in a Bayesian network given a set of observed nodes.

26. Answer: B Jeffreys prior is a non-informative prior that is invariant under reparameterization. It's proportional to the square root of the determinant of the Fisher information matrix.

27. Answer: B Bayesian model comparison computes posterior probabilities for different models using Bayes factors, while frequentist testing uses p-values and significance levels to reject or fail to reject hypotheses.

28. Answer: B P(A|B) is the probability of event A occurring given that B has occurred. P(B|A) is the probability of event B given A. They're related through Bayes' theorem but generally different values.

29. Answer: B A credible interval is the Bayesian analog to confidence intervals. It contains a specified probability mass (e.g., 95%) of the posterior distribution, representing uncertainty about the parameter.

30. Answer: B Improper priors (those that don't integrate to 1) may lead to improper posteriors that don't integrate to 1, making them invalid probability distributions and inference impossible.

31. Answer: B In Bayesian inference, simpler models naturally get higher posterior probability through the marginal likelihood (Bayesian Occam's Razor). Complex models that fit data only slightly better are penalized.

32. Answer: B Cromwell's Rule states to avoid using prior probabilities of exactly 0 or 1, as no amount of evidence could then change these beliefs. Leave room for revision with new evidence.

33. Answer: B In sequential updating, today's posterior becomes tomorrow's prior. This is the essence of Bayesian learning: P(θ|data₁, data₂) can be computed as updating P(θ|data₁) with data₂.

34. Answer: B Bayes' theorem underlies many machine learning algorithms including Naive Bayes classifiers, Bayesian networks, Gaussian processes, and Bayesian neural networks. It provides a principled approach to learning from data.

35. Answer: B P(A|defect) = P(defect|A) × P(A) / P(defect)

  • P(defect) = 0.02×0.4 + 0.03×0.35 + 0.05×0.25 = 0.008 + 0.0105 + 0.0125 = 0.031
  • P(A|defect) = (0.02×0.4)/0.031 = 0.008/0.031 ≈ 0.258 ≈ 0.26

36. Answer: B The Dutch Book argument shows that if your beliefs violate probability axioms, a clever gambler could construct a series of bets (Dutch Book) that guarantees you lose money regardless of outcomes.

37. Answer: B Bayesian AB testing can directly compute P(A > B), giving probability statements about which variant is better, rather than just rejecting null hypotheses. It also handles optional stopping better.

38. Answer: B The Jeffreys-Lindley paradox shows that with large samples and sharp null hypothesis, Bayesian tests (with certain priors) and frequentist tests can give contradictory results—one rejecting, the other accepting H₀.

39. Answer: B In Bayesian interpretation, hypotheses can have probabilities representing degrees of belief. This contrasts with frequentist interpretation where hypotheses are either true or false, not probabilistic.

40. Answer: B The principle of insufficient reason states that when there's no information to distinguish between outcomes, assign them equal probabilities. It's a way to choose priors in complete ignorance.

Comments

Popular posts from this blog

Simple Linear Regression - and Related Regression Loss Functions

Today's Topics: a. Regression Algorithms  b. Outliers - Explained in Simple Terms c. Common Regression Metrics Explained d. Overfitting and Underfitting e. How are Linear and Non Linear Regression Algorithms used in Neural Networks [Future study topics] Regression Algorithms Regression algorithms are a category of machine learning methods used to predict a continuous numerical value. Linear regression is a simple, powerful, and interpretable algorithm for this type of problem. Quick Example: These are the scores of students vs. the hours they spent studying. Looking at this dataset of student scores and their corresponding study hours, can we determine what score someone might achieve after studying for a random number of hours? Example: From the graph, we can estimate that 4 hours of daily study would result in a score near 80. It is a simple example, but for more complex tasks the underlying concept will be similar. If you understand this graph, you will understand this blog. Sim...

What problems can AI Neural Networks solve

How does AI Neural Networks solve Problems? What problems can AI Neural Networks solve? Based on effectiveness and common usage, here's the ranking from best to least suitable for neural networks (Classification Problems, Regression Problems and Optimization Problems.) But first some Math, background and related topics as how the Neural Network Learn by training (Supervised Learning and Unsupervised Learning.)  Background Note - Mathematical Precision vs. Practical AI Solutions. Math can solve all these problems with very accurate results. While Math can theoretically solve classification, regression, and optimization problems with perfect accuracy, such calculations often require impractical amounts of time—hours, days, or even years for complex real-world scenarios. In practice, we rarely need absolute precision; instead, we need actionable results quickly enough to make timely decisions. Neural networks excel at this trade-off, providing "good enough" solutions in seco...

Activation Functions in Neural Networks

  A Guide to Activation Functions in Neural Networks 🧠 Question: Without activation function can a neural network with many layers be non-linear? Answer: Provided at the end of this document. Activation functions are a crucial component of neural networks. Their primary purpose is to introduce non-linearity , which allows the network to learn the complex, winding patterns found in real-world data. Without them, a neural network, no matter how deep, would just be a simple linear model. In the diagram below the f is the activation function that receives input and send output to next layers. Commonly used activation functions. 1. Sigmoid Function 2. Tanh (Hyperbolic Tangent) 3. ReLU (Rectified Linear Unit - Like an Electronic Diode) 4. Leaky ReLU & PReLU 5. ELU (Exponential Linear Unit) 6. Softmax 7. GELU, Swish, and SiLU 1. Sigmoid Function                       The classic "S-curve," Sigmoid squashes any input value t...