Statistics for AI: The Fundamentals Every Developer Should Know
Populations, samples, variables, mean, variance — the statistical building blocks that power every AI model, explained for developers.
I used to think statistics was something other people worried about. Researchers. Analysts. The kind of people who actually enjoy spreadsheets.
Then I started studying AI at UFPR and realized something uncomfortable: every machine learning model I'd ever used was built on statistics. Not vaguely — foundationally. The math behind predictions, the logic behind training data, the reason we split datasets into train/test — it's all statistics.
So I went back to basics. Here's what I learned, and why it matters more than most developers think.
What Statistics Actually Is
Statistics is the science of collecting, organizing, analyzing, and interpreting data. That's the textbook definition. The practical one is simpler:
Statistics turns raw data into answers.
It breaks down into three big areas:
- Descriptive Statistics — your first contact with data. Tables, charts, and summary measures like mean, median, and standard deviation. The goal is to see what the data looks like.
- Probability — modeling uncertainty. What's the chance of something happening? Distributions like the normal curve, Poisson, and binomial live here.
- Statistical Inference — the big one. Drawing conclusions about an entire population from a sample. This is where regression models, hypothesis testing, and predictions come in.
AI uses statistical techniques extensively — especially for making predictions. Every model you train is, at its core, a statistical inference engine.
Populations and Samples
These two concepts are deceptively simple, but getting them wrong breaks everything downstream.
- Population: the complete universe you want to study. Every single element. Example: all households in Brazil.
- Sample: a subset of that population. Example: 10,000 randomly selected households.
Why not just study the whole population? Because it's usually impossible. A national census takes years and costs billions. A well-designed sample can give you reliable results in weeks, for a fraction of the cost.
But here's the critical part:
There is no statistical technique that can fix a badly collected sample.
If your sample is biased, your conclusions are biased. Period. That's why sampling design matters so much — defining your target population, choosing a sampling method, and sizing your sample correctly.
How Big Should a Sample Be?
For large populations (where the sample is less than 5% of the total), the formula is:
n = (Z × σ / D)²
Where:
- Z = confidence level (1.96 for 95% confidence)
- σ = standard deviation
- D = acceptable margin of error
Say you want 95% confidence (Z = 1.96), your data has a standard deviation of 10, and you'll accept a margin of error of ±2:
n = (1.96 × 10 / 2)² = 96 observations
Don't know the standard deviation upfront? Start with 50 observations, compute it, then calculate how many more you need. It's iterative — and that's fine.
Sampling Methods
There are two broad families. Probabilistic methods use randomness — every element has a known chance of being selected. Non-probabilistic methods rely on the researcher's judgment.
In practice, most rigorous studies use probabilistic sampling. The four main types:
- Simple Random — pure lottery. Every element has the same probability. This is what
train_test_splitdoes with your dataset. - Systematic — pick every k-th element. Think quality control on an assembly line: test every 1,000th screw off the line.
- Stratified — divide the population into groups (strata), then sample proportionally from each. If a university is 60% students, 30% staff, 10% faculty, your sample should mirror those ratios.
- Cluster — divide the area into clusters (e.g., neighborhoods), randomly select a few clusters, then survey everyone inside them.
Non-probabilistic methods — like convenience sampling (interviewing whoever is available), judgment sampling (handpicking experts), or quota sampling (matching population proportions without randomness) — are faster and cheaper, but carry bias risk. They're common in market research and early-stage exploration, less so in rigorous statistical work.
Variables: The Building Blocks of Data
Every dataset is a collection of variables — characteristics that vary between observations. Understanding their types determines which statistical tools you can use.
Qualitative variables describe categories:
- Nominal — no natural order. Gender, region, color.
- Ordinal — ordered categories. Education level (high school → bachelor's → master's), socioeconomic class (A, B, C, D, E).
Quantitative variables describe numbers:
- Discrete — whole numbers. Number of residents in a household: 0, 1, 2, 3.
- Continuous — fractional numbers. Income: R$ 1,200. Height: 1.76m. Temperature: 22.5°C.
This distinction matters because different variable types require different statistical treatments. You can calculate the mean of income, but the "mean" of gender doesn't make sense — you'd use the mode instead.
Measures of Position: Summarizing Data in One Number
When you have thousands of data points, you need ways to summarize them. The three most common:
Mean (Average)
x̄ = Σxᵢ / n
Add all values, divide by the count. Simple, familiar, and useful — but sensitive to outliers. One billionaire in a room of teachers will dramatically skew the "average income."
Median
The middle value when data is sorted. If you have 5 values, it's the 3rd. If you have 4, it's the average of the 2nd and 3rd.
The median is robust to outliers. That billionaire doesn't move the median much. This is why income reports often use median instead of mean — it better represents the "typical" person.
Mode
The most frequently occurring value. In the set , the mode is 32 — it appears three times.
Mode is most useful for qualitative data. "What's the most common blood type?" is a mode question.
A distribution can be bimodal (two values tied for most frequent) or multimodal (three or more).
Measures of Dispersion: How Spread Out Is Your Data?
Two datasets can have the same mean but look completely different. That's where dispersion measures come in — they tell you how spread out the data is around the center.
Variance
The average of the squared distances from the mean.
- Population variance (σ²): divide by N
- Sample variance (S²): divide by n-1 (a correction for small samples)
Why square the distances? Because unsquared differences from the mean always sum to zero — positive and negative deviations cancel each other out. Squaring forces all values positive, so the spread is captured instead of lost.
Standard Deviation
The square root of the variance. It brings the measure back to the original unit.
Here's the practical power of standard deviation in a normal distribution:
- ±1 SD from the mean → 68% of all data
- ±2 SD from the mean → 95% of all data
- ±3 SD from the mean → 99.7% of all data
Example: if electricity bills in a neighborhood have a mean of R$ 42 and a standard deviation of R$ 12:
- 68% of bills fall between R$ 30 and R$ 54
- 95% of bills fall between R$ 18 and R$ 66
That's incredibly useful for detecting anomalies. A bill of R$ 80? That's more than 3 standard deviations out — something's probably wrong.
Coefficient of Variation (CV)
CV = standard deviation / mean
This makes dispersion comparable across different scales. Example:
- Stock A: mean R$ 150, SD R$ 5 → CV = 3.3%
- Stock B: mean R$ 50, SD R$ 3 → CV = 6.0%
Stock B has a lower standard deviation but higher relative variation — meaning it's actually riskier.
Rule of thumb: CV < 15% = low variation. 15-30% = moderate. > 30% = high.
Quartiles and Percentiles
Quartiles split sorted data into four equal parts:
- Q1 (25th percentile): 25% of values fall below this
- Q2 (50th percentile): the median
- Q3 (75th percentile): 75% of values fall below this
The interquartile range (Q3 - Q1) measures the spread of the middle 50% of your data — ignoring outliers entirely. This is what a boxplot visualizes.
Why This Matters for AI
If you've ever used train_test_split in scikit-learn, you were doing sampling. If you've ever normalized features, you were using mean and standard deviation. If you've ever looked at a confusion matrix, you were doing descriptive statistics.
These aren't abstract academic concepts — they're the foundation that every ML pipeline is built on. Understanding them doesn't just make you better at AI. It makes you better at questioning the data, the model, and the results.
This post is based on my notes from IAA004 — Applied Statistics I at UFPR (Universidade Federal do Paraná), part of the Artificial Intelligence postgraduate specialization.