Probability Distributions: The Math Behind AI Predictions
Binomial, Poisson, Normal, T-Student — the probability distributions that power statistical models and machine learning, with real examples.
In the previous post, I covered the building blocks of statistics — populations, samples, variables, and descriptive measures. Now comes the part that connects directly to how AI makes predictions: probability distributions.
Every time a model outputs a confidence score, every time you see a p-value, every time an algorithm estimates the likelihood of an event — there's a probability distribution underneath. Understanding the main ones gives you intuition for what your models are actually doing.
What Is a Probability Distribution?
A probability distribution maps every possible outcome of a random variable to its probability of occurring. Think of it as an answer to: "If I repeat this experiment many times, what pattern will the results follow?"
There are two families:
- Discrete distributions — for countable outcomes (number of defective items, number of calls per hour)
- Continuous distributions — for measurable outcomes (temperature, income, height)
Statisticians have already tabulated the probabilities for the most common distributions. You don't need to derive them from scratch — you just need to know which one fits your problem and plug in the parameters.
Discrete Distributions
Binomial: Success or Failure, Repeated
The binomial distribution models a series of independent trials where each trial has exactly two outcomes — success or failure — and the probability of success stays constant.
Formula: P(X = x) = C(n, x) × p^x × (1-p)^(n-x)
Where:
- n = number of trials
- x = number of successes you're looking for
- p = probability of success on each trial
Example: 70% of stocks on the NYSE had their prices increase last month. You recommend 10 stocks to a client. What's the probability that none of them go up?
- n = 10, x = 0, p = 0.70
- P(X = 0) = C(10, 0) × 0.70⁰ × 0.30¹⁰ = 0.0028 (0.28%)
Almost zero — which makes sense. If 70% of stocks go up, having all 10 of your picks go down would be spectacularly unlucky.
Poisson: Counting Events Over Time
Poisson models the number of events occurring in a fixed interval of time or space — when events happen independently and at a known average rate.
Formula: P(X = x) = (λ^x × e^(-λ)) / x!
Where λ (lambda) is the average number of occurrences.
Example: an IT help desk receives an average of 5 calls per hour. What's the probability of receiving exactly 3 calls in the next hour?
- λ = 5, x = 3
- P(X = 3) = (5³ × e⁻⁵) / 3! = (125 × 0.0067) / 6 = 0.1404 (14.04%)
Poisson shows up everywhere: website visits per minute, server errors per day, customer arrivals per hour. If you've ever set up alerting thresholds, you were implicitly reasoning about Poisson distributions.
Geometric: How Many Failures Before Success?
The geometric distribution models the number of failures before the first success.
Formula: P(K = k) = (1-p)^k × p
Where p is the probability of success and k is the number of failures before it happens.
Example: a chemistry test has a 40% pass rate. What's the probability of failing twice before passing on the third attempt?
- p = 0.40, k = 2 (failures)
- P(K = 2) = 0.60² × 0.40 = 0.36 × 0.40 = 0.144 (14.4%)
This pattern shows up in retry logic too — if an API call has a 90% success rate, the probability of needing 3+ retries is (0.10)² × 0.90 = 0.9%. That's why exponential backoff works: failures stack up fast.
Hypergeometric: Sampling Without Replacement
Unlike the binomial (which assumes each trial is independent), the hypergeometric distribution handles sampling without replacement — where each draw changes the probabilities for the next one.
Example: a company has 6 employees, 3 of whom have been there 5+ years. You randomly select 4. What's the probability that exactly 2 of the 4 have 5+ years?
- Population N = 6, successes in population = 3, sample n = 4, target x = 2
- P(X = 2) = [C(3, 2) × C(3, 2)] / C(6, 4) = (3 × 3) / 15 = 0.60 (60%)
This comes up in quality control, auditing, and any scenario where you're drawing from a finite pool without putting items back.
Continuous Distributions
Normal Distribution (Z): The Bell Curve
The most important distribution in statistics. It's symmetric, bell-shaped, and described entirely by two parameters: mean (μ) and standard deviation (σ).
Why is it so important? Three reasons:
- Many real-world processes naturally follow it — heights, test scores, measurement errors.
- It approximates other distributions — with large enough samples, both Binomial and Poisson converge toward the normal curve.
- The Central Limit Theorem — sample means follow a normal distribution regardless of the underlying population, as long as the sample is large enough.
Standardization converts any normal distribution to one with mean = 0 and SD = 1:
Z = (x - μ) / σ
This lets you use a single reference table for any normal distribution. The critical value everyone memorizes: Z = 1.96 for 95% confidence (2.5% in each tail).
T-Distribution (Student's t): When Samples Are Small
The T-distribution looks like the normal curve but with heavier tails — meaning extreme values are more likely. It's used when:
- Your sample size is small (n ≤ 30)
- You don't know the population standard deviation
It depends on degrees of freedom (n - k, where k is the number of parameters being estimated). As degrees of freedom increase, the T-distribution converges to the normal — above 100 observations, they're practically identical.
Example: you A/B test a new checkout flow with 20 users. The average session time is 4.2 minutes with a standard deviation of 1.1. Is that significantly different from the old flow's 5.0 minutes? You'd use the T-distribution (not Z) because n = 20 is small and you're estimating SD from the sample itself.
The T-distribution is the workhorse of regression analysis. Every time you see a "t-statistic" next to a coefficient in a regression output, that's this distribution being used to test whether the coefficient is significantly different from zero.
Chi-Square (χ²): Testing Fit and Independence
An asymmetric distribution used for:
- Goodness-of-fit tests — does observed data match an expected distribution?
- Tests of independence — are two categorical variables related?
- Variance testing — is the variance of a population equal to a hypothesized value?
It also uses degrees of freedom, which vary depending on the specific test.
Example: you survey 200 users about their preferred payment method (credit card, PayPal, crypto) and want to know if preference varies by age group. A Chi-Square test of independence tells you whether the relationship between age and payment preference is statistically significant, or just random noise in the sample.
F-Distribution (Fisher): Testing Models
The F-distribution tests whether a regression model as a whole is statistically significant. It uses two types of degrees of freedom:
- Numerator: number of parameters (restrictions)
- Denominator: n - k (observations minus parameters)
When you see "F-statistic" in a regression summary, it's answering: "Is this model better than just guessing the mean?" A high F-statistic with a low p-value means yes — at least one of your predictors has explanatory power.
Logistic Distribution: Binary Outcomes
The logistic distribution powers logistic regression — one of the most common models in classification. The outcome is binary: 0 or 1, yes or no, spam or not spam.
It maps any input to a probability between 0 and 1 using the sigmoid function — the S-shaped curve that compresses any value into the (0, 1) range. If you've ever built a spam filter, a churn predictor, or a fraud detection model, you've used this distribution — even if the library abstracted it away.
Choosing the Right Distribution
The most common mistake isn't getting the math wrong — it's picking the wrong distribution for your data. Here's a quick decision guide:
| Your data looks like... | Use |
|---|---|
| Fixed number of yes/no trials | Binomial |
| Count of events in a time/space interval | Poisson |
| Trials until first success | Geometric |
| Drawing from a finite pool without replacement | Hypergeometric |
| Continuous, symmetric, bell-shaped | Normal (Z) |
| Small sample, unknown population SD | T-Student |
| Testing categorical relationships | Chi-Square |
| Testing overall model significance | F |
| Binary classification outcome | Logistic |
Why Developers Should Care
If you're training a model that predicts whether a user will churn, you're using logistic regression (logistic distribution). If you're evaluating whether a new feature improved conversion rates, you're running a hypothesis test (T or Z distribution). If you're building an anomaly detection system, you're defining thresholds based on standard deviations from the normal distribution.
You don't need to memorize formulas — libraries handle the computation. But knowing which distribution applies and why gives you the intuition to question results, debug models, and make better architectural decisions.
This post is based on my notes from IAA004 — Applied Statistics I at UFPR (Universidade Federal do Paraná), part of the Artificial Intelligence postgraduate specialization.