[Basics] Part 4. Building a credit model: scorecards and trees

In Part 3 we looked at what to choose your model to be. In credit, trees come out ahead on raw performance, but where explanation is required by law, logistic regression is preferred. This piece is about actually building those two. If Part 3 was “what do you pick,” this one is “how do you build it.”

Let me lead with the punchline: the two paths overlap more than you’d think. Defining the target, accounting for rejected applicants, calibrating the probabilities — those steps are the same whatever the model is. What diverges is one part in the middle: how you engineer the variables and how you explain the result. We’ll walk through the shared ground and the fork in turn.

The same starting point

Whichever model you head toward, the beginning is identical.

First you decide what counts as a default and how long you’ll watch before you call it. This is the performance window from Part 1. Whether you treat 60 days past due as a default or 90, whether you observe for 12 months or 24, changes the ground-truth label itself, and once the label changes, the model changes with it. Then you check for the leakage from Part 2. If future information that wasn’t available at decision time seeps into a variable, your validation scores look great and the model collapses in production.

Up to here it’s the same whether you’re using logistic regression or a tree. The model splits after this.

With logistic regression: the scorecard

In core underwriting, where explanation is required by law, you build a scorecard with logistic regression. A scorecard is, in a word, a points table. You assign points to each of a customer’s characteristics, add them all up, and out comes a credit score — a model you can explain on a single sheet of paper.

The scorecard starts from a decision not to use variables as they are. You take a variable like income or credit limit, split it into a handful of bins, then convert each bin into a value called WOE (Weight of Evidence). WOE expresses the ratio of good customers to defaulters within each bin as a log-odds. If good customers pile up in a bin, its WOE goes one way; if defaulters pile up, it goes the other. Once you do this, a jagged raw variable takes on a monotone relationship with default risk.

WOE is beloved in credit for several reasons. It absorbs missing values and outliers naturally as their own bins, it handles categorical variables the same way, and above all it pairs well with logistic regression. Since the variable is already translated into the language of log-odds, the regression coefficient becomes exactly that variable’s influence. How useful a variable is gets summarized by its IV (Information Value). Roughly, below 0.1 is weak, around 0.3 is usable, and much above 0.5 you start suspecting leakage instead. A variable that fits too well is one you should first suspect of having future information leak in.

Fit the regression and the output is log-odds. Awkward for a human to read. So you shift it onto a business score — that 600-or-700 credit score you’ve heard of.

score = offset + factor × log-odds

The key here is the PDO (Points to Double the Odds). It sets how many points the score moves when the odds get twice as good; set PDO to 20 and the score rises 20 points every time the odds double. The result is an intuitive scale where a higher score means lower risk. And because this score decomposes into the sum of per-variable points, you can explain, item by item, exactly why a given customer’s score came out low. This is why the scorecard is so strong for adverse-action notices.

With trees: GBM

Where after-the-fact explanation is enough, you use a tree as the final model. That’s the choice from Part 3. Going with a tree changes two things in the middle.

First, the feature engineering gets lighter. A tree handles nonlinearity, interactions, and missing values on its own, so you don’t have to run WOE binning. It leans closer to the “feed the domain features in as they are” side from Part 3. You build a variable like utilization (usage over limit) or a recent delinquency trend and put it straight in.

Second, you add the explanation after the fact. With a scorecard the points table itself is the explanation; with a tree it isn’t. Instead you use SHAP or feature importance to compute, after the fact, how much each variable pushed this prediction up, and carry that over into the adverse-action reasons. Monotonicity like “the higher the income, the lower the risk” — where the scorecard forced it through WOE, a tree enforces something similar with monotone constraints. So that even when the data briefly misbehaves, the model doesn’t move in a nonsensical direction.

So what actually differs

Set the two paths side by side and the difference becomes clear.

With the scorecard, the explanation is baked into the model. Show the points table and the reasons fall out; a regulator can peer at every coefficient. In exchange, its expressive power is limited, so you give up a little performance.

The tree is the reverse. You gain performance but attach the explanation from outside, and you carry the burden of separately checking whether that after-the-fact explanation is real. The variables take less handling, but the model governance gets heavier.

So the choice, in the end, is settled by purpose. Core underwriting, where reason disclosure is mandated by law and a regulator looks at the coefficients, leans toward the scorecard; a seat where after-the-fact explanation is accepted and performance matters leans toward the tree. That’s the fork from Part 3.

What you go through either way

That was the part that splits by model. But whichever one you picked, two things remain that you absolutely have to go through at the end: reject inference and calibration. Both are needed the same whether the model is a tree or a logistic regression.

Reject inference. The homework that’s followed us since Part 0. It’s selection bias. The model learns from the outcomes of customers approved in the past. Yet what it actually has to judge is the full set of new applicants coming in. For the customers you rejected, whether they’d have repaid or not, the outcome simply doesn’t exist. Point a model built only on approved customers at the whole applicant population as-is, and it’s biased.

The technique that tries to fill that gap is reject inference. You estimate the rejected customers’ outcomes from their scores and slot them in, or weight by the inverse of the approval probability (the same logic as the bias correction from Part 2), or pull in what became of rejected customers at other lenders from credit-bureau data. Honestly, reject inference is no silver bullet. It’s the work of filling in the unseen with an assumption, so when the assumption is wrong, the result is wrong along with it.

So before you trust it, the right move is to first measure whether this correction actually helps on your data. I bundled several reject inference techniques behind a single API and, above all, attached a benchmark that measures their effect first, in a Python library called rejectkit that I’ve built and released. I’ve written up the detailed usage and real-data results separately in Reject inference and rejectkit.

The most trustworthy solution is a different one entirely: a controlled experiment where you deliberately approve a small random slice to secure the real outcomes. Because it hands you fact, not an estimate. This thread continues into causal inference and experiments in Part 6.

Calibration. One thing you have to keep separate. Getting the ranking of who’s riskier right (discrimination) and getting exactly what percent a person’s default probability is (calibration) are different problems. In credit, calibration is mandatory. Recall EL = PD × LGD × EAD from Part 1. To set aside provisions, to price by risk, to compute expected loss, the probability itself has to be right. Ranking alone can’t set a price. It matters even more with a tree, because the score a tree puts out isn’t a default probability as-is.

So you calibrate the probability the model puts out to the actual default rate. You use something like Platt scaling or isotonic regression, and if you sampled the classes unevenly, you undo that ratio here too. One more thing: even at the same PD, it splits into a PIT (point-in-time) figure that reflects the risk at the current moment as-is, and a TTC (through-the-cycle) figure that smooths over the business cycle. Accounting standards (IFRS9) look mainly at PIT, capital regulation (Basel) at TTC. Even the same customer’s default probability becomes a different number depending on what you use it for.

Wrapping up

Building a credit model, it turns out, is less about the type of model than about what comes before and after it.

The start is the same. You define the target and the performance window and check for leakage.
Go with logistic regression and you transform variables with WOE and build a points table with PDO. The explanation is contained inside the model.
Go with a tree and you feed the domain features in as they are and build the explanation after the fact with SHAP and the like. You gain performance in exchange for adding the explanation afterward.
The end is the same too. You reduce bias with reject inference and calibrate the probability. These two you always go through, regardless of the model.

We’ve now been through choosing a model (Part 3) and building one (this piece). In Part 5 we’ll look at how you evaluate the model you built — from KS, the metric most beloved in credit, to Gini and AUC, and how to check calibration.