[Paper] SSL falls short of GBM on credit data. But combined, it helps

This is the first “Paper” installment of the series. I take one experiment I actually ran and lay it out at a practitioner’s eye level.

Back in Part 3, I argued that on tabular credit data, deep learning loses to tree-based boosting (GBM). And that where deep learning does earn its keep isn’t the core model but the unstructured, supplementary data around it. This post is a record of testing exactly that boundary, on public data, myself.

The question is simple. Can self-supervised learning (SSL) beat GBM at credit default prediction? SSL learns representations from raw data without labels — the same approach behind BERT and GPT on text, and SimCLR and MAE on images. What happens when you apply it to a customer’s transaction sequence?

Here’s the conclusion up front: the question itself is slightly off. SSL isn’t a candidate to replace GBM. It helps consistently only when you add it to GBM as an auxiliary feature. I’ve broken down how I got there into four parts.

How I ran the experiment

I used public data only. AMEX Default Prediction (Kaggle 2022): 459,000 customers, each with up to 13 months of anonymized transaction history. I split it 80/10/10 at the customer level and froze that split so every experiment used the same one.

For the metric, I use the competition’s official score. It blends how well you catch defaults (the minority class) via Gini and a top-4% capture rate — all you need to know is that it’s a tight metric where, around 0.79, a difference of 0.001 actually means something.

There are two contenders.

GBM baseline: 1,291 hand-crafted features + LightGBM. Straight from a published top-tier solution’s setup, so it lands at leaderboard top-10 level (test 0.79558).
SSL encoder: a small transformer (about 870K parameters), pretrained four different ways (masking, next-step prediction, contrastive, and a mix).

I ran all of it on a single laptop GPU (8GB) in a little over 20 hours, for zero cloud spend.

Finding 1: on its own, SSL falls short of GBM

Even after fine-tuning all four SSL variants, the best came in at test 0.79267. That falls short of the GBM baseline of 0.79558. The gap is small, about 0.003, but across all eight evaluation combinations, not one beat GBM. The direction is consistent.

The interesting part is what happened when I cut the labels. SSL’s classic selling point is “it’s strong when labels are scarce” — here, it was the opposite. Shrinking the labels to 1%, 5%, 25%, 100% and comparing, GBM won at every level, and the gap actually widened as labels got scarcer. At 1%, the gap stretched to 0.05.

The reason isn’t hard to guess. Training an 870K-parameter transformer on 4,000 people is overkill. LightGBM, by contrast, stays stable on little data thanks to strong regularization, and above all it grows only as much as it needs to, on top of 1,291 features that condense decades of domain knowledge. “On tabular data, trees are strong,” from Part 3, holds up here too.

Finding 2: your result changes with how you evaluate

This one was a bit of a shock. I took the exact same pretrained encoder and evaluated it two ways. One freezes the encoder and attaches only a logistic regression on top (a linear probe); the other retrains everything, encoder included (full fine-tune).

And the ranking of the four variants flipped completely depending on the evaluation. The “mix” variant was dead last under the linear probe but first under full fine-tune. Same model, and the score difference between the two evaluations was as much as 0.058.

The lesson here is clear. One paper could look only at the linear probe and conclude “this variant is the weakest,” while another looks only at fine-tuning and concludes the exact opposite, “it’s the strongest.” Both are right within their own experiments. So when you read a tabular-SSL result, the first thing to check is which evaluation protocol they used.

Finding 3: but combined with GBM, it gets better

Here the result turns. Instead of throwing away the 128-dimensional embeddings the SSL encoder produces, I concatenated them onto the existing 1,291 features and fed the lot into GBM.

At first I repeated it across three different pretraining seeds, and all three beat the baseline (mean +0.00142, t=4.1). But three seeds make it easy to be optimistic about the variance, so I bumped it to six and looked again. All six still beat the baseline, but the lift dropped to a mean of +0.00117 (test 0.79675), with a standard deviation of 0.00098 and a t-value of about 2.9 (5 degrees of freedom). Smaller and more honest than the first pass. Still, that all six seeds point the same way means this lift isn’t a fluke.

The numbers themselves are small. But once you pull apart where this small lift comes from, it isn’t so small after all.

What SSL fills in: rediscovering half of the domain knowledge

The 1,291 features going into GBM aren’t mere preprocessing. Things like “last value,” “trend slope,” “missingness pattern” — they’re closer to decades of domain knowledge pre-etched into the data pipeline. Pretraining done by hand, if you like.

So how much does SSL’s neural pretraining overlap with that? Drop the top 100 features GBM ranked as most important and retrain, and the score falls by 0.00592. Add the SSL embeddings back in, and 0.00324 of that is recovered. A recovery rate of about 55%.

In other words, SSL rediscovers on its own, from raw data and without any labels, roughly half of what an expert’s top hand-crafted features encode. The other half it can’t reach — because that’s information not present in the raw signal, like the meaning of a business calendar or a qualitative reading of missingness.

Where the lift hides: the customers who look safe

There’s no reason a mean of roughly +0.001 would be spread evenly across all customers. I split customers into deciles by GBM’s predicted score and looked at the lift by bin — and it clustered on one side.

In the safe bins where GBM was confident “this person will almost never default” (the bottom 0–3 deciles of the prediction), the lift clustered at +0.02 to +0.03. In the rest, it’s near zero.

Why this matters comes back to the asymmetric costs from Part 1. In credit, the most expensive mistake is misclassifying a defaulting customer as “safe.” The loss is realized in full, 100%. What the SSL embeddings catch is exactly that — the defaults hiding inside customers who look safe, the silent defaults. The real value is larger than the average number lets on.

So, in practice

To sum up.

Don’t try to replace GBM with a single SSL model. On its own it falls short of GBM, and more so the fewer labels you have.
Use it as an auxiliary feature instead. Concatenate one encoder’s 128-dimensional embeddings onto your existing features and feed them to GBM, and you get a steady lift. And the cost is barely an hour of pretraining — lighter than one more sweep of GBM hyperparameters.
Judge by loss per bin, not the average. +0.001 may look small, but the effect of cutting hidden defaults in the safe bins is larger than that.
One encoder is enough. When I got greedy and combined the embeddings from all four encoders, the training score went up but the test score actually fell. Textbook overfitting. The NLP intuition that “the more pretraining objectives you add, the better” didn’t carry over here.

The honest limits

This result holds only within the following limits.

It’s one dataset. Being a single dataset, AMEX, it can’t guarantee generalization to other areas of credit.
The encoder is small. At 870K parameters, there’s room for the 55% recovery rate to rise if you scale it up.
It’s six seeds. The first three gave t=4.1, but bumping to six brought it to t≈2.9 — a more honest, more conservative signal. There’s still room to go further.
It’s a random split, and this data makes a proper out-of-time validation hard. In AMEX’s anonymization, every customer’s last transaction is packed into a single month, March 2018, so splitting by time buys you only a few days’ separation. Testing a real shift in time — a business cycle, seasonality — needs different data with a wider time range. Doing the out-of-time validation properly, the thing Part 3 stressed, is the most important follow-up.

Within these limits, the fact that all six seeds point positive, the 55% recovery rate, and the lift concentrated in one bin I take to be trustworthy signals.

Closing

Whether SSL “works” or “doesn’t work” in credit was the wrong question from the start. The real question was “where and how does it help?” SSL isn’t GBM’s competitor but an auxiliary feature. Its value is concentrated in the hidden defaults among customers GBM thought were safe, and one encoder is enough.

Part 3 said “where deep learning earns its keep is unstructured, supplementary data” — this post is a record of confirming that one sentence in numbers.

Appendix: code and data

Data: AMEX Default Prediction (Kaggle 2022), public data
Code and paper: github.com/HangilKim11/blog-research/tree/main/ssl-credit-risk
Reproduction: a single 8GB laptop GPU, a little over 20 hours