[Basics] Part 5. Ranking isn't enough: three axes for evaluating a credit model

In Part 4 we built a model. So now, how do you tell whether it’s any good? For an ordinary machine-learning classifier you usually look at one thing: how well it gets the answers right, which really means how well it ranks. In credit, that alone isn’t enough.

Evaluating a credit model splits into three questions. Does it rank well — who is riskier than whom (discrimination)? Do those risks line up with the actual default probability (calibration)? And does the assessment stay consistent as time passes and the customers change (stability)? Ordinary ML covers the first one too, but the last two matter more precisely because this is credit.

Discrimination: how well does it rank?

Start with discrimination — how cleanly the model separates the risky customers from the safe ones.

The discrimination metric practitioners most often report as the headline is AUC. AUC is the probability that, if you draw one defaulter and one non-defaulter at random, the model gives the defaulter the higher risk score. At 0.5 it’s a coin flip; at 1.0 it’s perfect separation; it’s also the area under the ROC curve. In credit we often restate it as Gini. Since Gini is 2 × AUC − 1, an AUC of 0.5 is a Gini of 0 and an AUC of 1.0 is a Gini of 1. It’s the same information rewritten around zero, so when the industry says “a Gini 0.5 model,” it means an AUC of 0.75.

But as we saw in Part 3, default is usually a rare event, on the order of 1 to 5%. When one side is this scarce, ROC-based AUC can look more generous than it really is. So for imbalanced problems like default, you look at PR-AUC (the area under the precision-recall curve, i.e. average precision) or the top-k% capture rate alongside it, because both are more sensitive to how many of the rare defaults you actually catch. The AMEX metric we saw in the SSL piece was also one of this family — a blend of top-4% capture and Gini.

The two do have different characters, though. AUC and Gini don’t move when the default rate changes (they’re prevalence-invariant), which makes them good for comparing models across portfolios or time periods. PR-AUC, by contrast, depends on the default rate itself, so you can’t compare its numbers directly across datasets with different default rates. They aren’t rivals so much as different tools for different jobs. When you need to compare, reach for AUC; when you want to know whether the model catches the rare defaults, reach for PR-AUC.

Beyond those, one more discrimination metric has rooted itself especially deep in credit: KS (Kolmogorov-Smirnov). The idea is simple. You sweep the scores from low to high and, at each point, compare what percent of defaulters have accumulated so far against what percent of non-defaulters have. In a good model the defaulters cluster at the low scores and the non-defaulters at the high scores, so the two cumulative curves pull far apart. KS is the size of the gap at the point where they’re widest.

There’s a reason KS has stuck around in credit for so long. It boils down to one number, it sits between 0 and 100 so it’s intuitive, it doesn’t move with the default rate the way AUC doesn’t, and above all it fits the underwriting mindset of splitting approve from decline with a single cutoff line. The point where KS is largest is itself the candidate cutoff where non-defaulters and defaulters separate best. Roughly, above 20 is usable and above 40 is a strong model. Too high and you start to suspect leakage instead — that same story from Part 3.

These days practitioners more often lead with AUC or PR-AUC as the headline, but KS still shows up like a standard in regulatory documents and scorecard validation. Even if you don’t use it yourself, you’ll run into it often enough that you need to be able to read it. Either way, no matter how well you rank, ranking alone doesn’t mean you’ve finished evaluating a credit model.

Calibration: the probabilities have to match reality

In Part 4 I said discrimination and calibration are different problems. Getting the order right — who is riskier than whom — is separate from getting the exact percentage right — what this person’s default probability actually is. And in credit, that calibration is non-negotiable.

The reason is EL = PD × LGD × EAD from Part 1. To set aside provisions, to price rates and credit limits by risk, to compute expected loss, the default probability itself has to be right. If the ranking is correct but the probabilities are inflated, you’ve lined everyone up properly and still priced them wrong.

The fastest way to check calibration is to look. Bin customers by their predicted default probability, then, in each bin, plot the average probability the model predicted next to the fraction that actually defaulted. In a well-calibrated model the two are equal, so the points land on the diagonal. However far they drift off the diagonal is how far the probabilities are off.

The key point here is that discrimination can be good while calibration is off. A model with high AUC can have wildly inflated probabilities. In particular, as we saw in Part 3, if you resample the data 50/50 to deal with imbalance, or use a tree model’s scores straight as if they were probabilities, calibration breaks. So you always look at discrimination metrics and calibration together. You can summarize it numerically with the Brier score or expected calibration error (ECE), but it’s a good habit to check with your eyes first, the way the plot above does.

A cause you meet especially often is the weighting in the training step. If you weight the minority class — defaults — or oversample them to push discrimination up (Part 3), the ranking improves but the output probabilities come out inflated relative to the real default rate. So you train discrimination as discrimination, and fit the probabilities separately, after training finishes. Three methods are standard. Platt scaling re-maps the model output through a logistic function; isotonic regression uses a monotonic step function for a more flexible, non-parametric fit. And when you know the sampling or weighting ratio, you add an offset of exactly that ratio to the log-odds, pulling the inflated probabilities back into place — the same principle as the scorecard offset from Part 4.

Stability: does it hold up over time?

The third question is stability, which matters especially in credit. You build a model on past data and apply it to the future. But as time passes, the economy shifts, the customer base shifts, the products shift. A model that was good when you built it can fall apart six months later.

The standard metric for watching this is PSI (Population Stability Index). It measures, in a single number, how far the score distribution at development time has drifted from the distribution of customers coming in now. Roughly, below 0.1 is stable, 0.1 to 0.25 warrants attention, and above 0.25 is a signal that the distribution has moved a lot and you should revisit the model. The same approach measures the distribution shift of each individual variable (CSI, Characteristic Stability Index), pinpointing which variable moved.

If PSI is the early warning on the input distribution, then once outcomes accumulate you re-measure the performance itself. As real defaults settle, you periodically recompute AUC and KS, plus predicted default rate against actual default rate, to track the decay. But since default only settles well after the fact, as we saw in Part 1, while you wait on outcomes you watch leading signals first — the score distribution, the approval rate, the distribution of decline reasons. The whole book can look fine while a particular segment falls apart first, so you also break it out by meaningful groups like age band or product. And when these metrics cross a pre-set threshold, you decide whether re-calibrating the probabilities alone is enough or whether to retrain the model from scratch.

Stability is one and the same with the out-of-time validation Part 3 stressed. This distribution shift is exactly why a model that looked fine under randomly shuffled validation collapses under time-respecting validation and in real operation. That’s why a credit model isn’t build-it-well-once-and-you’re-done; it’s something you have to keep monitoring.

Translating to the business: cutoff and trade-offs

Metrics only mean something once you turn them into a decision. In credit, that decision is usually the cutoff — from what score do you approve?

Raise the cutoff and declines go up, so defaults fall, but the approval rate falls with them. Lower it and it’s the reverse. So you look at a curve, not a single score. As you vary the approval rate, how does the default rate change; if you catch the top few percent, what percent of all defaults do you filter out — you track this with gains and lift curves.

Given a single cutoff, you weigh two things together. Of all the defaulters, what percent does that threshold catch (recall)? And of the people it catches, what percent are actually defaulters (precision)? Lower the cutoff to decline more, and you catch more defaulters (recall up), but good customers get swept into the filtered group and precision drops. The reverse holds too. The PR curve from the discrimination section is exactly this relationship between the two, and the gains curve plots the recall part of it against the approval rate. So you don’t look at a single point; you set the cutoff with recall and precision side by side.

When you replace an existing model with a new one, you look at the swap set. At the same approval rate, you compare who gets newly approved (swap-in) and who gets newly declined (swap-out), to confirm you’re really taking in the better people and filtering out the worse ones. A one-line improvement in an average metric means nothing if the customers who actually change turn out to be the wrong ones. Recall the asymmetric costs from Part 0: depending on which mistake is more expensive, the weight you put on the cutoff and the trade-offs shifts.

Wrap-up

Evaluating a credit model doesn’t end with one number. You ask three questions together.

Discrimination: does it rank well? AUC and Gini are the headline, and for imbalance like default you look at PR-AUC. KS is the traditional regular in regulation and scorecards.
Calibration: do the probabilities match the actual default rate? Plot predicted against actual by bin to check with your eyes, and use Brier or ECE for the numbers. It can be off even when discrimination is good.
Stability: does it hold up over time? Watch distribution shift with PSI. It’s one and the same with out-of-time validation.

Where ordinary ML stops at discrimination alone, credit looks further, at calibration and stability. Because this is work where you price with probabilities and apply the model to the future.

In Part 6 we change direction and move from correlation to causation. A question like “does raising a credit limit increase defaults?” isn’t a prediction problem but a causal one, and the answer comes from causal inference and experiments.

Discrimination: how well does it rank?

Calibration: the probabilities have to match reality

Stability: does it hold up over time?

Translating to the business: cutoff and trade-offs

Wrap-up

Related

Get new posts by email