[Basics] Part 3. Where deep learning doesn't win: machine learning for scoring

Part 2 laid the statistical groundwork, so now it’s time for machine learning. In credit too, you ultimately pick a model by looking at performance. It’s just that credit has its own particular reasons for deciding performance, and one more axis you have to weigh beyond performance.

The single biggest thing that decides performance is the shape of the data. Credit data is tabular: one customer per row, each cell holding income, or a limit, or a delinquency count — structured data. And on tabular data, the deep learning that conquered images and speech turns out to be surprisingly weak. Tree-based boosting wins instead.

At the end of Part 2 I promised two things: why logistic regression is still in use in practice, and why cross-validation in finance has to be done differently. Let me add two more here: why the thing that actually gets used in practice is a tree model, and what credit-specific constraints keep following you even after you’ve picked a model.

On tabular data, trees win

Deep learning spent the last decade conquering images, then speech, then natural language, one after another. That left the impression that “if you stack a big enough neural net on good data, you win.” But on tabular structured data, that impression doesn’t hold up well.

The reason lies in the nature of the data. Images have strong structure between neighboring pixels, sentences between adjacent words. Deep learning is good at digging into that structure. Tabular data, by contrast, has cells whose units and meanings are all over the place, with no such spatial structure. On top of that, credit data doesn’t pile up by the millions the way ImageNet does. In this kind of setting, tree-based boosting (XGBoost, LightGBM) fits better. Its approach of picking a threshold per feature and slicing along it is a natural way to handle cells that are each their own thing.

This isn’t just anecdote. Benchmark studies that put deep learning and trees head to head on tabular data keep landing on the same conclusion. Tabular Data: Deep Learning is Not All You Need (Shwartz-Ziv & Armon, 2021) reported that XGBoost outperformed the deep learning models proposed at the time across several datasets, and Why do tree-based models still outperform deep learning on typical tabular data? (Grinsztajn et al., 2022) showed across 45 datasets that tree-based methods are the current state of the art on medium-sized tabular data. So when you pick a model in credit by looking at performance, very often you end up with tree boosting as the final model.

Lacking a performance edge isn’t the only reason. In practice, a black-box model is a risk in itself. It’s hard to explain why it made a given call, hard to gauge how it will move when the inputs shift a little, and heavy on validation and governance. This is a field where one big mistake is expensive, so credit is instinctively wary of that kind of opacity. Trees aren’t perfectly transparent either. But a tree pays the price in performance and buys back some of that opacity with SHAP or feature importances to attach an explanation. On tabular data, deep learning brings in an even darker box without even paying that price — so there’s little to gain for the risk.

None of this means deep learning is useless in credit. Its place just isn’t the core scoring model — it’s the unstructured, supplementary data. Pulling features out of things like support-chat text or transaction sequences and handing them to the main model, that sort of thing. As for standing a neural net up front over the core variables already laid out in a table, there’s no clear reason to, at least for now.

Why logistic regression survived

On performance alone the scale tips toward trees, but logistic regression hasn’t disappeared. Its reason for staying isn’t performance — it’s explanation.

In some domains, the model has to be interpretable in itself. Underwriting where you’re legally required to disclose the reason for a rejection; regulatory-capital models where the supervisor scrutinizes each coefficient one by one. In those seats, logistic regression, where a coefficient is an explanation, and the scorecard that translates it into points, are still strongly preferred. You can enforce monotonicity, the output is stable, and reason disclosure comes naturally.

When you do use a tree as the final model, you build that explanation after the fact. You look at feature importances to see which variables matter, and pull out with SHAP how much each variable contributed to an individual prediction, then translate that into a reason for rejection. It’s an accepted approach in practice, which is how you meet the explanation requirement while still using a tree. But a post-hoc explanation is an interpretation of a model you looked into, not the model itself, so a burden remains: you have to separately check the stability and fidelity of those reasons.

To sum up, model choice is a trade between two axes. On performance, trees lead; in seats that must be interpretable in themselves, logistic regression leads. Most real-world work picks somewhere in between, weighing performance against the demand for explanation.

Validation has to run into the future

This is where the second promise from Part 2 comes in. In credit, you have to do cross-validation differently.

Ordinary k-fold cross-validation shuffles the data at random into several pieces and takes turns using one piece for validation. But when you shuffle at random like this in credit, the future leaks into the past. Train on 2024 customers mixed with 2025 customers, and the model is being evaluated having peeked a little at the future. Yet deployment always happens in the future. You train on past data and apply to new customers who haven’t arrived yet.

So validation in credit has to respect time.

Out-of-time validation: train up to some point in the past, and validate on the period after it. You’re watching how well the model holds up on a future it has never seen.
Keep the performance window: as we saw in Part 1, default is only confirmed a good while later. Using recent customers whose outcomes haven’t ripened yet as if they were ground truth is leakage.
The same customer on one side only: if one customer’s record lands in both training and validation, the model has effectively memorized the answer.

The overfitting story converges on the same place. What makes overfitting scary in credit is that it barely shows in a random holdout. The real exam comes after time has passed and the population has moved. The most common cause of what I called in Part 0 “validation metrics look great but the model falls apart in production” is exactly this: validation that ignores time. Tree or logistic regression, the habits of pressing the model down toward simplicity with regularization, not getting greedy with variables, and holding validation off into the future outlast any one line of performance.

Feature engineering is 80% of the job

Once you’ve settled on the type of model, what actually decides performance isn’t the algorithm — it’s the variables. The saying that 80% of the value in practice is in feature engineering fits credit especially well. Tree or logistic regression, it’s the same.

Good variables come from the domain. The structure of the card business we saw in Part 1 becomes variables directly: the ratio of usage to limit, the trend of delinquency over the last few months, the regularity of payments, how often the balance touches the limit. A variable engineered from ratios, trends, and recency connects to default better than a single row of raw data.

And here too the axis of explanation follows. In a seat where you have to disclose reasons, you prefer variables that are monotonic and clear in meaning where you can. A variable that connects to the direction of default in a common-sense way is what keeps the model stable and makes the reason easy to explain later. Tens of variables with clear meaning often outlast thousands of auto-generated ones in credit.

Imbalance — but don’t touch it carelessly

The last one is imbalance. In Part 2 I said default is usually a rare event of 1 to 5%. That’s why the standard machine learning advice tends toward “grow the minority class” — things like oversampling or SMOTE. But in credit, you have to pause a beat here.

The problem is calibration. Tree or logistic regression, if you artificially balance the data to fifty-fifty, the ranking ability may be similar but the probabilities it outputs drift away from the real default rate. We saw EL = PD × LGD × EAD in Part 1, and here PD has to be accurate as a probability itself. Shake the ratio with sampling and that probability inflates, and the moment you compute EL and set a price, things start to go wrong.

So in credit, when handling imbalance, rather than carelessly resampling the data, people often choose to handle it with class weights, or to train while keeping the original ratio and calibrate the probabilities at the end. The very fact that it’s a rare event is information — you shouldn’t erase it.

What remains even after AutoML arrives

Lately AutoML has been entering finance in earnest too — tools that automatically run through model candidates, tune hyperparameters, and automate even part of the feature engineering. There’s no question they cut down the mechanical part of modeling a lot. So does everything I’ve been talking about become unnecessary soon? Quite the opposite. What automation carries off is the routine work, and what remains are exactly the judgments this whole piece has stressed.

First, effective variables are still made by people. AutoML finds the best model for the variables it’s given, but it can’t decide which variables to give it. A variable like “how has usage relative to the limit changed recently” is a hypothesis that only occurs to you if you know the card business. Forming hypotheses from the domain and shaping them into variables — 80% of feature engineering — sits outside automation. If you don’t feed in good variables, even the best automatic search just spins in place on top of what it has.

Next, someone who knows the constraints has to pick the model. AutoML usually moves in the direction of maximizing a single metric like accuracy. But in credit, the model that scored highest is often one you can’t use as is — it can’t be explained, or its probabilities are off, or it runs afoul of regulation. Whether to enforce monotonicity, how far to allow which model, how to calibrate the probabilities — someone who understands the constraints decides that.

Finally, interpreting the results and validating them properly remains. Looking into an AutoML-produced model with SHAP and translating it into reasons, and doubting whether that score is real, are both a person’s job. Validation especially. Many automation tools use random cross-validation by default. But as we saw earlier, shuffling at random in credit lets the future leak into the past, producing scores that look better than production. If you don’t know the validation that fits the domain, you can be taken in by the rosy numbers AutoML hands you.

So automation raises the floor; it doesn’t stand in for judgment. Forming hypotheses to make variables, picking a model within the constraints, interpreting the results, and validating honestly — that’s exactly the seat this piece has been about.

Wrap-up

Machine learning in credit also picks a model by looking at performance. It’s just that the conclusion differs from the usual intuition.

Credit data is tabular, and on tables the winner is tree-based boosting, not deep learning. Pick on performance and a tree often ends up the final model.
Deep learning has no performance edge on tabular data, and being a black box, it carries big risk too. Its place is unstructured supplementary data, not core scoring.
Logistic regression stays for explanation, not performance. It’s the default in seats that must be interpretable in themselves. When you use a tree, you meet that demand with post-hoc explanation like SHAP.
Validate into the future. Don’t shuffle at random, respect time, and don’t put the same customer on both sides.
What decides performance is the variables more than the algorithm, and good variables come from the domain.
Don’t touch imbalance carelessly. You have to protect the probabilities.
Even after AutoML arrives, the work of making variables from hypotheses, picking a model within the constraints, and interpreting and validating the results remains a person’s job.

More than the type of model, it’s the validation and constraints around that model that decide machine learning in credit.

In Part 4 we get into the process of actually building a model. I’ll center it on the standard pipeline for translating logistic regression into a scorecard — engineering variables with WOE and IV, scaling log-odds into points, and calibrating the probabilities. If this piece was about “what to pick,” the next is about “how to build.”