[Basics] Part 2. Statistics first: how to read credit data
Before you reach for machine learning, statistics comes first. In credit, you ask 'is this difference real or just noise?' far more often than 'does the model fit well?' Here's the shape of financial data, the trap of multiple testing, how to handle small samples, and the bias that's baked in by default.
We looked at the domain in Part 1, so now we go down to the data itself. But there’s a reason I don’t jump straight to machine learning and start with statistics instead. The question we ask all day long in credit is less “does the model fit well?” and more “is this difference real, or is it just chance?” Statistics is the tool that answers that question.
It isn’t the flashy part. But if this layer gives way, whatever model you stack on top of it is built on sand. When a model produces conclusions that don’t match reality, this layer is often where the trouble is.
The distributions are different to begin with
A typical intro to statistics starts with the normal distribution. But walk into financial data assuming everything is normal and you’ll usually be off. Credit data is a different shape from the start.
- Default is either 0 (good) or 1 (default), and default is usually a rare event, 1 to 5%. The mass sits overwhelmingly on the good side.
- Loss given default (LGD) is often bimodal, piling up near 0 (almost fully recovered) and near 1 (almost nothing recovered). Take the average and there’s barely any data near that average value.
- A variable like transaction amount drags a long right tail. Most are small, a few are enormous. A fat tail.
So in credit you look at the distribution first. You use a log transform to tame the tail, you look at quantiles rather than the mean, and you pick a distribution that matches the shape of the data, like a beta distribution (LGD) or a negative binomial (delinquency counts). Assume the wrong distribution and every calculation on top of it starts to go crooked.
Not a point, but an interval
Saying “this segment’s default rate is 3%” only tells you half the story. That 3% carries completely different weight depending on whether the sample was 100 people or 100,000. That’s why interval estimation matters as much as the point estimate. And it’s worth being precise about the confidence interval: it isn’t “the probability that the true value lies in this interval,” but rather the proportion of such intervals that would contain the true value if you repeated the estimation the same way many times. This is a common misunderstanding.
In credit, small-sample segments come up all the time. Early customers on a new product, a particular limit band, a particular channel. Here the bootstrap is practical. You resample your data with replacement many times over, watch how much the statistic shifts each time, and use that to gauge the uncertainty.
The most common trap in statistics: multiple testing
I want to call this one out separately. When you build a credit model, candidate variables come by the hundreds. You compare each one against the default rate and ask whether it’s “significant.” But test hundreds of them at once and, by pure chance alone, variables that “look significant” come pouring out. Test 100 completely unrelated variables at a 5% threshold and, on average, 5 will come back significant.
So that “curious pattern” you spotted while scanning the variables is usually an artifact of multiple testing. Don’t trust it without correction. The simplest correction is Bonferroni: the more tests you run, the stricter you make the bar to pass, so if you’re looking at 100 variables you use 0.05% as your threshold instead of 5%. It’s safe but too conservative, so it tends to drop real variables along with the false ones. That’s why Benjamini-Hochberg (FDR) sees more use in practice. It controls “what fraction of the ones you flagged as significant is allowed to be false,” so by tolerating a few false ones it catches more of the real variables. In Part 0 I mentioned models where “the validation metrics look great but it falls apart in production,” and this is one of the causes.
A Bayesian eye for small segments
Where the frequentist only lets the data speak, the Bayesian multiplies “what you knew beforehand” by “what the data says” to update a belief. Where this pays off especially well in credit is estimating the default rate of a small segment. If a 30-customer segment comes back with a 0% default rate, there’s no way it’s truly 0%. The less data you have, the more you pull that estimate toward the overall average; the more data accumulates, the more you let it trust its own observation. This is shrinkage. In effect, smaller segments get corrected closer to the overall average, larger segments closer to their own value. A hierarchical model handles this pull automatically, inside the model structure rather than by hand. It estimates each segment separately while borrowing information from the whole, so that data-poor segments naturally drift toward the overall.
Bias is the default
The last one is bias. And in credit, bias isn’t the exception; it’s the default.
- Selection bias: you only see the outcome of customers you approved. That’s the Part 0 story.
- Survivorship bias: when customers who have already left drop out of the data, you look only at the ones who stayed and draw the wrong conclusion.
- Leakage: future information you couldn’t have known at decision time seeps into a variable. Carelessly using a field that got updated after a default, for instance. It’s the prime suspect behind validation metrics that look unrealistically good.
- Sample-population mismatch: the population you trained on differs from the one you’ll actually apply the model to. This is the problem new underwriting always carries.
When you handle samples, stratified sampling is close to mandatory on imbalanced data, and above all you have to keep asking whether “this data represents the population of the question I’m trying to answer.”
Wrapping up
Statistics isn’t flashy, but in the end it’s a tool for honesty. Look at the distribution first, speak in intervals rather than points, distrust multiple testing, handle small samples with humility, and assume bias is the default. The eye for reading credit data starts here.
In the next installment, Part 3, we stack machine learning on top of this foundation. We’ll look at why logistic regression survived as the scoring standard even in the boosting era, and why, in finance, even cross-validation has to be done differently.