[Basics] Part 0. 7 ways finance data science differs from ordinary ML

I’m not a grizzled veteran of this field. I worked as a manufacturing engineer, crossed over into finance, and these days I work as a data scientist on the credit side. So please read this less as “here is the right answer” and more as a set of notes: the things that tripped me up when I got here, the moments of “wait, I did it by the book, so why does it keep coming out wrong?”

The funny part is that it wasn’t just me. People who are perfectly good at the whole arc of ordinary ML, from building a model to evaluating it, make the same kinds of slips when they land in credit underwriting. The validation metrics look great but the model underperforms in production; accuracy is 99% and nobody is happy; you squeeze out another 0.01 of performance and the risk team blocks the release.

That isn’t really a skill problem. Finance, and credit underwriting especially, isn’t “applying ML to financial data.” It’s a field that runs on somewhat different rules. And almost everything this series will cover — reject inference, causal inference, calibration, validation, fairness — ultimately rests on those rules.

1. Selection bias is the default

The training data we hold has one big hole in it: we only ever see the repayment outcome of the customers we approved. Whether the ones we turned down would have paid or defaulted, we can never know. They were never issued a card in the first place.

Ordinary ML usually assumes the data represents the population. In credit underwriting, that assumption is broken from the start. The training data is customers who were already approved in the past, yet the population the model actually has to judge is the whole set of applicants who haven’t been approved yet. They are two different populations.

This one fact causes more trouble than you would expect. Because there is no post-rejection data for the customers you turned down, the model can’t learn the region it rejected, and it inherits the bias of the past approval policy wholesale. That’s why, in this field, reject inference and causal inference aren’t exotic techniques — they’re table stakes. (I’ll give each of them its own dedicated piece later.)

2. Time flows one way, and models age

If you shuffled the data at random and ran K-fold, you were quietly peeking at the future: your validation folds mix past and future together.

Credit data flows along a timeline. A model trained on 2024 applicants scores 2026 customers. In between, the economy shifts, rates rise, customer behavior and products change. The distribution drifts. Random K-fold blends past and future, smuggling into validation the kind of information you could never actually have at decision time.

So the baseline validation in finance is out-of-time (OOT): evaluate on a period later than the one you trained on. After deployment, you keep monitoring how far the distribution has moved and how customers change as time passes. A model starts aging the moment it ships.

3. “Who is riskier” isn’t enough — you need “exactly what percent”

Ordinary classification usually just needs the ranking to be right. Line people up by who is riskier, and AUC measures how well you did.

Credit can’t stop there. You need an absolute probability, a calibrated PD. You need to be able to say “this customer’s default probability is exactly 3.2%” to price for risk (risk-based pricing), to set aside provisions, to compute expected loss. Ranking alone buys you none of that.

So a certain situation is quietly common in credit: a model with great AUC and wrong PDs. Discrimination and calibration are different axes, and you have to earn both. (I’ve set aside a whole piece just on calibration. People skip it more often than you’d think.)

4. Costs are asymmetric, arrive much later, and come in currency

Accuracy counts every error the same. In credit, the weight of an error is nothing like equal.

Approve one good customer and you earn a margin (a few thousand yen). One default costs you LGD × EAD (hundreds of thousands of yen). One side is dozens of times heavier. So what you optimize isn’t accuracy — it’s expected profit and expected loss.

expected profit = (1 − PD) × margin − PD × LGD × EAD

And the answer arrives much later. Whether a customer you approve today defaults or not is only settled 12 to 24 months down the line. Labels arriving this late clash hard with an ML mindset used to fast feedback. You have to keep stacking decisions without knowing the outcomes.

5. Stability beats peak performance

In an ML competition, squeezing out another 0.001 of AUC is a virtue, the way it is on Kaggle. In a production credit model, it is often a liability.

A model that turned unstable chasing one more drop of performance soon becomes a cost in operations: scores that lurch when the inputs barely move, results you can’t reproduce, weird stretches where “higher income lowers the score.” Operational stability, reproducibility, and monotonicity matter more than a decimal of performance more often than not. It is one reason logistic regression survived as the scoring standard well into the GBM era.

6. Interpretability isn’t a choice — it’s a duty

In other fields, being able to explain “why did this prediction come out?” is a nice bonus. In credit, not having it is often illegal, or simply un-deployable.

Adverse-action notices, explanations to regulators, internal governance — all of them demand that you explain “why this score.” So a black box isn’t cool; it is a risk in itself. That’s why practitioners favor structures where the reasons fall out naturally, like WOE and scorecards, and why, even when they reach for boosting, they lay down something like SHAP alongside it to pull the reasons back out.

7. Regulatory and governance overhead is always there

Finally, you can’t just deploy a model freely.

Finishing the model isn’t the end. Model risk management (MRM), independent validation, documentation, and audit trails are part of the development process. Developers and validators are kept separate, and a new model usually runs in shadow mode alongside the old one for a good while before it ever touches a real decision. The startup instinct of “ship the high-performing model fast” doesn’t travel well here. There’s a reason things are slow: a single model flows all the way through to provisioning and capital calculations.

(Working in Japan makes this even more tangible. Card issuance and limits are bound by the “payment-capacity estimate” (支払可能見込額) duty under the Installment Sales Act (割賦販売法), so the model becomes a legal basis. I’ll cover this in the regulation piece.)

Isn’t AI going to do all of this for us?

I get this question a lot lately: with generative AI and agents advancing this fast, do you really need to learn this modeling stuff? My honest answer is that you need it more, not less — at least for now.

The seven things above aren’t a particular algorithm; they’re the structure of the problems in this field. Unobserved counterfactuals, data that flows in time order, asymmetric costs, absolute probabilities, stability, a duty to explain, regulation. Bolting an LLM onto that doesn’t make these problems disappear. If anything, you need someone who knows they are there, or an auto-generated model will be confidently wrong.

Points 6 and 7 are the crux. You have to give reasons for a rejection, validate the model independently, and that result becomes the basis for provisioning and capital. A black-box model is structurally blocked by these requirements. That’s why generative AI can’t swallow credit underwriting whole; instead, someone who understands “why it has to be explainable and how you validate it” stays in the seat that judges what the AI produced.

Some things do change, of course. Repetitive coding and basic analysis increasingly become the AI’s job. So the center of gravity of the work shifts, from the ability to hand-craft models toward the judgment to frame a problem correctly, validate it, and oversee it. The latter is exactly what this series is about.

So, what does skill look like in this field

Squeeze the seven into a single line:

Finance data science isn’t a “prediction-accuracy contest.” It’s the work of estimating an unobserved counterfactual — in an environment where time flows and costs are asymmetric — in a way that stays explainable and stable.

Metrics and scorecards are like an entry ticket. The real difference in skill is decided in selection bias, causality, validation, and governance.

In this series I plan to dig into these seven, one at a time: how you actually solve reject inference, why everyone gets calibration wrong, why causal inference sits at the heart of underwriting, how to validate so a model survives in production. Let’s get into it from the next piece.