[Deep Dive] Where do rejected applicants go? Reject inference and rejectkit

In Part 4 I touched on reject inference briefly. The point was that a model built on approved customers alone becomes biased once you apply it to the full applicant pool. This post is a record of writing that reject inference in code. I bundled the classic techniques together and, above all, built a library that also evaluates “whether this correction actually helps on my data.” It’s called rejectkit, and it’s now on PyPI and GitHub.

You will never know how the rejected applicants turned out

Picture a loan review. An applicant arrives, you look at their information, and you either approve or reject. For the ones you approve, you learn the outcome a few months later: pay well and it’s good, fall behind and it’s a default. But the ones you reject were never given a loan, so there’s nothing to repay. You will never know how they would have turned out.

That’s where the trouble starts. To build a model that predicts whether next year’s applicants will fall behind, you need labeled data — and the only labels you have belong to the people you approved. But approved customers are a skewed sample: they got through precisely because they looked fine in the first place. A model trained on them alone has a distribution that’s out of step with the full set of applicants who actually show up at your door.

It’s the same as a doctor who concludes “my treatment works great” from the patients who came back. The ones it didn’t help never returned, so they never make it into the data. This is exactly the selection bias from Part 0 and the reject inference from Part 4.

And yet the Python tooling was empty

Reject inference has been a standard topic in credit risk for decades. And yet, when I went looking, the Python tooling was empty.

R has a package called scoringTools, but it isn’t even on CRAN — only on GitHub. Python’s scorecard libraries (scorecardpy, optbinning, toad) do the WOE/IV binning and logistic scorecards from Part 4 well, but they don’t touch reject inference at all. What was left was one-off research code written for papers.

So I built rejectkit. There were two goals. One was to bundle eight classic techniques behind a single scikit-learn-style API. The other — the more important one — was to provide a benchmark that measures “whether this correction actually helps on my data.” The truth is that “the value of reject inference is doubtful, and no technique is always superior” has been the field’s long-standing conclusion. A paper literally titled “Reject inference — can it work?” was already out in 1993. So the real message of this library isn’t “trust it and use it,” it’s “measure first.”

Eight techniques, and the assumption that decides everything

The ways of bringing rejects back into training fall into roughly three families.

First, methods that manufacture labels for the rejects and fill them in (the augmentation family). Score the rejects with the approval model and cut at a threshold to assign labels (simple); split one person into a good version and a bad version, two rows with weights (fuzzy); multiply the default rate by score band by a weight to encode the practitioner’s assumption that “a reject is worse than an approved customer at the same score” as a number (parcelling); or borrow the default rate of nearby approved customers with similar features (extrapolation).

Second, methods that correct without manufacturing labels. Train a selection model that separates approvals from rejects and give approved customers an inverse weight (IPW reweighting — the same logic as the bias correction from Part 2), or take econometrics’ Heckman control function and fold it into the classification problem as an extra feature.

Third, semi-supervised learning. Leave the rejects as unlabeled data, attach pseudo-labels only to the high-confidence ones, and repeat the retraining loop (self-training).

Which technique works comes down, in the end, to a single assumption: what the rejection depends on.

If rejection depends only on observed features (MAR), a well-built model actually isn’t all that biased.
If rejection also depends on the unobserved outcome (MNAR) — say a past reviewer screened out bad applicants using information that isn’t in the data — then a model that saw only approved customers is the most biased of all.

And here’s an important limitation. The augmentation family leans on the biased approval model to guess the rejects’ labels, so it can’t pull itself out of a strong MNAR situation on its own. Which is why the right question isn’t “which technique should I use” but “does reject inference even help in my situation right now” — asked first.

How do you measure whether it helps

The fundamental difficulty of reject inference is this: rejects have no ground truth, so you can’t grade the correction directly.

rejectkit sidesteps this. It takes data where you know every outcome, deliberately picks some rows to play the role of rejects, and hides (masks) their labels. Then it grades how well each technique recovers those hidden labels, scored on an untouched test set. The “Masked” in the name is this label hiding.

The key metric is auc_recovery. Zero means on par with the naive model that used approved customers only; 1 means recovery all the way to the oracle that used every label; negative means it made things worse.

You pick how rejections are generated from three options: mar, which depends only on features; mnar, which also depends on the hidden outcome (the harshest); and cutoff, which approves in order of lowest predicted risk (closest to a real credit policy).

So does it help: synthetic data vs. real data

I ran it on clean synthetic data with MNAR. The oracle is 0.820 and naive is 0.749, but the reject-inference techniques barely clear naive and several actually lose ground. fuzzy scrapes break-even at +0.001, parcelling degrades to -0.116, and only Heckman just about holds the naive line. When selection depends on the outcome (MNAR), it plays out exactly as theory predicts. Reject inference is no free lunch.

So let’s check on real data. I applied it as-is to the Kaggle Home Credit dataset (about 300,000 rows, 8% default rate).

Model	AUC	auc_recovery
oracle (all labels)	0.741	1.00
naive (approved only)	0.568	0.00
parcelling	0.582	+0.084
extrapolation	0.582	+0.079
heckman	0.580	+0.071

Here it’s the opposite. Ignore the rejects and you collapse from 0.74 to 0.57, and reject inference actually recovers 7–8% of that loss. What was nearly useless on synthetic data is worth having on real data. And even with the same dataset, switch to mar or cutoff and naive is already oracle-grade, so there’s no room left to recover (auc_recovery comes back as NaN).

The conclusion lands in one place. Whether reject inference helps depends on the data, which is why you have to confirm it with a benchmark before you use it.

How to use it

Fitting reject inference onto a single model goes like this.

from sklearn.linear_model import LogisticRegression
from rejectkit import RejectInferenceClassifier

# X_accept, y_accept: approved customers and outcomes (1=default) / X_reject: rejects (features only)
clf = RejectInferenceClassifier(
    estimator=LogisticRegression(max_iter=1000),
    method="parcelling",
    method_params={"uplift": 1.3},   # assume rejects are ~30% worse than the same score band
)
clf.fit(X_accept, y_accept, X_reject)
pd_bad = clf.predict_proba(X_new)[:, 1]

Measuring whether it helps first is one line.

from rejectkit import MaskedRejectBenchmark

bench = MaskedRejectBenchmark(selection="mnar", accept_rate=0.6, random_state=0)
print(bench.compare(["fuzzy", "parcelling", "reweighting", "heckman"], X, y).round(4))

Inputs accept pandas, polars, and numpy alike. It also ships with diagnostics (per-feature PSI) and visualizations for seeing how far the approval and rejection distributions diverge.

Wrapping up

Train a credit model on approved-customer data alone and you get sample-selection bias. That’s the problem from Part 0 and Part 4.
rejectkit bundles eight classic techniques for correcting it behind one API.
The real differentiator is the benchmark that measures whether the correction helps on your data. Reject inference isn’t a cure-all and can even be harmful, so the core idea is to measure first rather than trust and use.
The same principle applies beyond credit, to any selection-bias setting where you only see outcomes for those who got through — hiring performance, or fraud detection where only the flagged cases get labeled.

And the most trustworthy answer to reject inference is, in fact, a different one entirely: an experiment that deliberately approves a small random slice to secure real outcomes. It hands you fact instead of an estimate. That thread continues into causal inference and experiments in Part 6.

Installing is one line.

pip install rejectkit

GitHub: github.com/HangilKim11/rejectkit
PyPI: pypi.org/project/rejectkit

Feedback, issues, and PRs welcome.