[Review] Can Google's new tabular foundation model TabFM beat GBM in credit? I tested it on public data

This is the first piece in the “Review” strand of the series. It’s where I take a newly released tool and, through a practitioner’s eyes, measure whether it’s actually useful for credit work.

Google recently released TabFM: feed it a table as-is and it predicts with no training and no tuning, and it claims that “zero-shot, it beats even a well-tuned GBM.” If that really holds in credit risk — a place GBMs have dominated for years — it’s a big deal. So I pitted it against a carefully built GBM on public loss data. To give the conclusion first: under the conditions I ran, a well-tuned GBM came out slightly ahead. But this was not a head-to-head under the same conditions the paper claimed its win, so I can’t declare a winner — and more to the point, the key finding is that TabFM came right up against that best with no feature engineering and no tuning.

What TabFM is

TabFM is a foundation model for tabular data that Google released at the end of June 2026. The core idea is zero-shot. It doesn’t retrain per dataset, tune hyperparameters, or engineer features. Instead it reads the training data in as a single context, and prediction is done in one forward pass. This is called in-context learning. Much as GPT takes a handful of examples in the prompt and answers, TabFM takes the whole table as a kind of prompt and answers on the spot.

What makes this possible is an enormous amount of pretraining done in advance. But unlike images or text, there isn’t nearly as much public tabular data, so Google synthesized hundreds of millions of tables with structural causal models (SCMs) and pretrained on those. The learned architecture has three parts. It attends alternately over rows and columns to capture relationships between features, compresses each row into a dense vector, and then predicts on top of that via in-context learning. Think of it as an evolution of the TabPFN line, an earlier attempt at a tabular foundation model.

The selling point is clear: no feature engineering, no hyperparameter tuning. In other words, the effort you pour into a GBM in practice drops to zero. And Google reported that on the TabArena benchmark this zero-shot model outranks painstakingly tuned supervised models — GBM-family models in particular — on ELO. That benchmark gathers 38 classification and 13 regression datasets, ranging from 700 to 150,000 samples. The model is out now on Hugging Face and GitHub, and Google says it will land in BigQuery within a few weeks, callable from a single line of SQL (AI.PREDICT).

Why the experiment: can it be used on credit losses

I had one question. This model that supposedly beats GBM on general benchmarks — does it do the same on credit losses?

In Part 3 I concluded that on tabular credit data it’s tree-based boosting, not deep learning, that wins. TabFM claims to overturn exactly that conclusion, so it was worth re-checking. And credit isn’t like a general benchmark. Default is a rare event; you need not just the ranking but the probability to be accurate (Part 5); and you have to be able to explain the reasons behind a decision (Part 4). Winning on general data doesn’t guarantee winning on credit. So I checked directly, on public data, whether zero-shot TabFM beats a well-tuned GBM in credit too.

Data and models

The data is UCI’s Taiwan credit-card default dataset: 30,000 people, 23 features, a 22% default rate. The main features are the last six months of delinquency status plus billed and paid amounts, and it’s known in the literature as a signal-limited dataset where standard models tend to stall somewhere around AUC 0.77 to 0.78.

The heart of a fair comparison is separating the model from the features. Compare a “feature-engineered GBM” against “raw TabFM” directly and, even if you see a gap, you can’t tell whether it’s the model or the features. So I lined up several baselines side by side.

Name	What it is
GBM raw	LightGBM, no features, no tuning (the effort floor)
GBM tuned	LightGBM + engineered features + Optuna tuning
CatBoost / XGBoost	Strong baselines with features + tuning
TabFM zero-shot	Out of the box, single forward pass (the main event)
TabFM ensemble	Paper preset (derived features + calibration)

The GBM-family models got features like delinquency dynamics added and were tuned with Optuna. TabFM was left out of the box, with zero tuning. I trained every model at the natural rate so the probabilities would match the actual default rate. For metrics I looked, as credit demands, at both discrimination (ROC-AUC, PR-AUC, KS) and calibration (Brier, ECE) — the two axes from Part 5. Since default is 22%, the imbalance isn’t extreme, but I also checked how well the rare defaults are caught with PR-AUC. I validated with stratified 5-fold.

Results

Arm	ROC-AUC	PR-AUC	KS	ECE ↓	Time
GBM tuned (LightGBM)	0.789	0.566	0.443	0.010	548s
XGBoost	0.789	0.565	0.439	0.009	102s
CatBoost	0.788	0.566	0.444	0.011	1179s
TabFM zero-shot	0.785	0.558	0.441	0.022	503s
GBM raw	0.779	0.554	0.429	0.013	0.5s
TabFM ensemble	0.774	0.540	0.418	0.018	268s

There are three things to read off this.

First, a well-tuned GBM (0.789) edges out TabFM zero-shot (0.785). All three trees sit above TabFM, and the ordering is the same on PR-AUC too (TabFM 0.558, tuned GBM 0.566). The gap is 0.4pp, inside the fold standard deviation (0.006), so statistically it’s close to a tie — but the direction is consistently GBM on top. “Zero-shot beats a tuned GBM” did not hold on this data.

Second, compare no-effort against no-effort and the story is different. TabFM zero-shot (0.785) clears raw GBM (0.779) — with no features and no tuning. As a fast baseline it’s genuinely attractive.

Third, calibration is on par too. Trained at the natural rate, the trees also produce well-matched probabilities (ECE 0.010), and TabFM comes in slightly behind at 0.022. Neither is badly off.

And the three boosters converge between 0.7885 and 0.7891. That means the ceiling on this data — where changing the model and adding features and tuning still doesn’t lift you — is around 0.79. Neither side got above it.

To sum up: at least in this experiment, TabFM didn’t beat a well-built GBM, it came close to it with no effort. And you have to keep in mind that the conditions weren’t the same.

Limits of this experiment

These results are only valid within a few conditions.

It’s one dataset. A single Taiwan credit-card set can’t guarantee generalization. On other loss data the ranking could change.
The signal is limited. With 23 features and a 0.79 ceiling, there wasn’t much room for a model to pull ahead in the first place. On data with many features and rich signal it could come out differently.
One seed, and I couldn’t do out-of-time validation. With no trustworthy time column I split by random stratification, but as I stressed in Part 3, a real credit model is more rigorously validated by splitting on time.
TabFM ran on an 8GB GPU. Because of that I couldn’t properly run the ensemble setup noted in the model limits below, so the TabFM numbers in the table above should be read as a lower bound.

Limits of the model itself

Leaving the experiment aside, here’s what catches when you try to bring TabFM into practice.

It’s a black box. Being a foundation model, there are no coefficients and no clear rules. For credit underwriting, where you have to disclose the reason for a decision and explain it to regulators (Part 4), it’s hard to use as-is. You can bolt on post-hoc explanations like SHAP, but the model itself doesn’t become the explanation the way a scorecard does.
It needs a high-end GPU. The weights alone are 6.5GB, so without a GPU inference is more than ten times slower, and the paper preset (ensemble, large context) needs a 16GB-plus GPU to run properly. That’s a sharp contrast with a GBM, which runs fine on a single CPU.
There’s a ceiling on data and feature size. In-context learning reads the whole training set in like a prompt, but attention grows with the square of the context. So millions-of-rows credit data, or data with very many features, is hard to load as-is. That’s exactly why this family of models was built for small- to mid-sized tables in the first place.
Inference is heavy. A saved GBM scores in one line, but TabFM re-reads the training data every time it predicts. In high-volume real-time scoring, that cost becomes a burden.
It’s still new. As a just-released model, it has no track record in practice and no cases of regulatory acceptance.

What’s next, and you too

TabFM is available on Hugging Face now, and within a few weeks it should be callable from a single line of SQL in BigQuery. The barrier to entry drops that much, so if you have a big GPU or the BigQuery integration opens up, it’d be worth running the ensemble preset properly and re-measuring on larger data. I’m especially curious how it does on small-sample data, feature-rich data, and time-based validation. This piece’s conclusion is only the story of one Taiwan credit-card round, after all.

Here’s my conclusion. In this one experiment GBM came out slightly ahead, but I didn’t compete under the conditions the paper claimed its win (large, diverse data; an ensemble preset; ample GPU), so I can’t declare that it “can’t win.” What is certain is this: it’s a remarkably fast baseline that lands within 0.4pp of that best with no features and no tuning. For prototyping or a first baseline it’s attractive right now, and for a production credit model that needs peak performance, accurate probabilities, and explanations, a well-tuned GBM is still the safe choice for now. It’s too early to say Part 3’s conclusion has fallen, but there’s now a candidate that comes close with no effort.

Appendix: code and data

Data: UCI Default of Credit Card Clients (Taiwan), public data
TabFM source: Google Research blog · Hugging Face · GitHub (the architecture diagram and benchmarks are in the source)
Code: github.com/HangilKim11/blog-research/tree/main/tabfm-credit (Korean and Japanese notebooks)
Reproducibility: stratified 5-fold