Back to Blog
2026-02-05
7 min read
When exact matching fails, probabilistic record linkage weighs evidence like a chef recognizes a dish—not by a single ingredient, but by the whole picture.
Birria in Jalisco is goat, slow-braised in dried chiles and served in a bowl with consommé. In Tijuana, it's beef, crisped on a plancha and folded into a taco. At Chipotle, it's a menu item with queso. Three different presentations, three different proteins, three different contexts—but you recognize them as the same dish.
You're not matching on a single ingredient. You're weighing evidence across multiple dimensions: the chile base, the braising technique, the way it's served with its cooking liquid.
Entity resolution works the same way. When you have two database records—"José García, 123 Main St, DOB 1985-03-15" and "GARCIA, JOSE M., 123 Main Street, DOB 03/15/1985"—you need to determine if they represent the same person. No single field gives you certainty, but together, the evidence points strongly toward a match.
I solved this problem on a project in financial services: deduplicating customer records, linking transactions across systems, matching entities without shared keys. It's harder than it looks, and most engineers reach for exact matching first—which works until it doesn't.
The instinct is to write rules:
If name matches AND DOB matches AND address matches → same person
This is deterministic matching, and it breaks the moment it touches real-world data.
Consider what customer data actually looks like:
| Field | System A | System B |
|---|---|---|
| Name | José García | GARCIA, JOSE M. |
| Address | 123 Main St | 123 Main Street, Apt 2B |
| DOB | 1985-03-15 | 03/15/1985 |
| jose.garcia@gmail.com | jgarcia85@yahoo.com |
Exact matching finds zero matches here. Every field differs in format, abbreviation, or completeness. But a human reviewer immediately recognizes this as likely the same person.
You could write transformation rules—normalize names to uppercase, expand abbreviations, standardize date formats. This helps, but it's a game of whack-a-mole:
Deterministic matching forces a binary decision: match or no match. Reality is probabilistic—some pairs are definite matches, some definite non-matches, and a large gray zone requires weighing evidence.
Probabilistic record linkage flips the question. Instead of asking "do these records match?" it asks:
How likely is it that these records refer to the same entity, given the evidence?
Each field comparison contributes evidence:
| Comparison | Evidence |
|---|---|
| Exact name match | Strong positive |
| Fuzzy name match (Jaro-Winkler > 0.9) | Moderate positive |
| DOB matches | Strong positive |
| DOB off by one digit | Weak positive (likely typo) |
| Completely different DOB | Strong negative |
The accumulated evidence produces a match score.
This mirrors how you recognize birria. If two recipes both have dried chiles, braised meat, and a rich cooking liquid served alongside, that's strong evidence they're the same dish. If one uses guajillo and the other uses ancho, that's weak variation—regional preference, not a different dish. If one serves it in a bowl and the other crisps it in a taco, that's presentation, not identity.
But if one is a quick stovetop stew without the slow braise? That's evidence you're looking at something else entirely—maybe carne guisada, but not birria.
The math behind probabilistic matching is the Fellegi-Sunter model, developed in 1969 and still the foundation of modern record linkage.
Fellegi-Sunter assigns each field comparison a weight based on two probabilities:
| Probability | Question |
|---|---|
| m-probability | If two records truly match, how often would this field comparison look like this? |
| u-probability | If two records don't match, how often would this field comparison look like this by chance? |
The ratio of these probabilities determines the weight:
Here's the key insight: common values provide weaker evidence than rare values.
"María García" matching "María García" is less compelling than "Xiomara Tlapoyawa" matching "Xiomara Tlapoyawa." The first could happen by coincidence in any large dataset; the second almost certainly indicates the same person.
Think of it like identifying a mole. If two recipes both contain chiles, that tells you almost nothing—every mole has chiles. But if both call for chocolate, banana, and chipotle in specific proportions, you're probably looking at variations of mole negro from Oaxaca.
The rare, specific ingredients carry more weight than the ubiquitous ones.
Splink handles all of this automatically. You define what fields to compare and how (exact match, fuzzy match, within a date range), and Splink estimates the m and u probabilities using unsupervised learning—no labeled training data required.
There's a computational catch. If you have a million records, comparing every pair means:
1,000,000 × 999,999 / 2 = 499,999,500,000 comparisons
That's 500 billion comparisons. Not feasible.
Blocking solves this by only comparing records that could plausibly match. You define rules like:
This is mise en place for data matching—organizing your workspace before you start cooking. A chef doesn't wander the entire kitchen looking for ingredients during service. They set up their station with everything they'll need in reach. Blocking sets up your comparison space so you're only looking where matches are likely to be found.
The art is in the balance:
| Problem | Cause |
|---|---|
| Missing matches | Blocking too aggressive—true pairs never compared |
| Slow performance | Blocking too loose—comparing pairs that obviously don't match |
You could implement Fellegi-Sunter from scratch, but Splink handles the hard parts:
| Feature | What it does |
|---|---|
| Unsupervised training | Estimates m/u probabilities without labeled data (Expectation-Maximization) |
| Multiple backends | DuckDB (laptop), Spark (cluster), AWS Athena (serverless) |
| Fuzzy matching | Jaro-Winkler, Levenshtein, phonetic matching built-in |
| Term frequency | Automatically down-weights common values like "José García" |
| Visualization | Inspect match weights, debug blocking rules, understand model behavior |
It's open source, built by the UK Ministry of Justice, and battle-tested on national-scale datasets including the 2021 Census.
Entity resolution isn't an academic exercise. In financial services, getting it wrong has real consequences:
| Use Case | Cost of Getting It Wrong |
|---|---|
| KYC/AML compliance | Fragmented risk profiles, regulatory failures |
| Fraud detection | Missed patterns across linked identities |
| M&A data migration | Duplicate customers in merged systems |
| Regulatory reporting | Incorrect unique customer counts |
The cost of false negatives (missed matches) is fragmented data and compliance risk. The cost of false positives (incorrect matches) is merged records for different people—a data quality nightmare.
Probabilistic matching lets you tune the tradeoff for your specific use case.