Methodology7 min read

Multi-model AI scoring, explained.

Why three frontier models score every Locata candidate independently — and why disagreement between them is more informative than any single confidence number.

May 8, 2026·Tech_42 / Locata team

There's a temptation, building on top of frontier AI models in 2026, to pick one. Pick the model that scores best on your benchmark, plug it in, run.

Single-vendor pipelines are simpler to reason about, easier to deploy, faster to invoice. They also throw away the most useful signal available to a serious decision-support system: the signal of disagreement between independent reasoners.

This piece is the case for ensemble scoring as a design choice — why Locata runs every candidate location through Claude, GPT, and Gemini independently, and why we treat the spread between their scores as a first-class output rather than noise to average away.

The single-model trap

A scoring model assigns a number to a candidate. The number comes with some sense of confidence — sometimes explicit, sometimes implicit. The buyer of the score, looking at a ranked list, treats high-confidence high-score as the green light to proceed.

The trap is straightforward: the confidence number is a property of the model's internal state, not of the underlying decision.

Two failure modes follow. The first is systematic confidence on a wrong axis. A model trained on patterns that happen to correlate with a particular kind of urban geography will be confident on candidates that look like that geography, and less confident on candidates that don't — regardless of whether the geography is actually predictive of the decision you care about. You don't see the systematic bias; you see a ranked list with high-confidence scores at the top.

The second is overconfidence under data sparsity. When the enrichment data is thin — say, for a rural candidate where Street View is sparse and demographic granularity is coarse — a single model still has to produce a confidence. It tends to be confident anyway, because that's what training optimised for. You don't see the data gap; you see a score.

Both failure modes are invisible from inside one model. They become visible when you ask another model the same question.

What three models give you

Three independent scorers give you three things that one cannot.

Agreement as a corroboration signal. When Claude, GPT, and Gemini independently score a location at 89, 91, and 88, that agreement is meaningful in a way that no single model's "high confidence" can be. The models were trained on different data with different objectives. Agreement across all three on a numerical scoring task is a structural argument that the underlying signals are real.

Disagreement as a flag. When the same three models score a candidate at 84, 62, and 47, the spread tells you something the average obscures: the data underneath is ambiguous, the prompt missed a constraint, or one model is keying off something the others aren't. Either way, that candidate needs a human read — not the average score plus an "acceptable" confidence.

Reasoning triangulation. Each model produces not just a score but the reasoning behind it: top three reasons, top two risks, citations to the enrichment data it used. When three models converge on similar reasoning, the rationale is robust. When they diverge on rationale even at similar scores, that's signal too — three different paths arrived at the same number, which tells you the candidate is overdetermined.

None of this works with one model. The whole point is comparing independent outputs against the same input.

Disagreement as a signal

The most counterintuitive part of multi-model scoring is treating disagreement as good news.

The grain of truth: any sufficiently large scoring run will produce candidates the models disagree on. In a thousand-candidate run, perhaps fifty to a hundred candidates will show wide spreads. A traditional pipeline averages them away or filters them out. Locata surfaces them as a distinct output: these are the locations where your scoring methodology needs human review, and here is what each model said and why.

For a buyer running a screening process under time pressure, this changes what the system is for. It's no longer "the AI scored everything for you." It's "the AI gave you a confident ranking on the bulk of your candidates, and identified the ones where your judgement is the highest-leverage input." That's a more useful system than a higher-accuracy single-model ranker would be — because the candidates the single model would have ranked confidently-but-wrongly are now flagged for review.

There is a real cost: three model inferences per candidate, not one. The compute economics are roughly 3× per scoring run. For a thousand-candidate run that pencils out to a small overhead on top of a much larger engagement. For a fifty-thousand-candidate run we batch and pre-filter against hard constraints first so the ensemble only runs against the ~5,000 that pass. The cost is paid in the right place.

What this doesn't fix

Ensemble scoring is not a fix for bad inputs. If the enrichment data is wrong, three models will be confidently wrong in agreement. If the scoring prompt is poorly specified, three models will produce reasonable-looking nonsense in unison. Garbage-in-garbage-out doesn't disappear when you triple the number of garbage-processors.

Ensemble scoring is also not a substitute for domain expertise. The prompts that drive the scoring are written with the customer's expert, not in isolation. The first half of any Locata engagement is prompt definition; the second half is the scoring run. The ensemble adds robustness on top of a well-defined task. It cannot create a well-defined task from a poorly-specified one.

What it does fix is the systematic-confidence-on-wrong-axis problem, the overconfidence-under-sparsity problem, and the absence-of-disagreement-signal problem. Those are three of the biggest pitfalls of single-model decision support — and they're three pitfalls that pure-prompt-engineering tweaking cannot solve.

What it looks like in practice

A Locata report carries the spread alongside the rank. For each candidate:

The composite score (a weighted aggregate of the three model outputs).
The per-model scores, side by side.
The agreement indicator (3/3, 2/3, 1/3) based on whether the per-model scores fall within a defined band.
The reasoning each model gave — overlapping bullets where they converged, divergent ones where they didn't.

The convention in our reports is to colour 3/3-agreement candidates green and 1/3-agreement candidates yellow, with a short note explaining the spread. Reviewers learn quickly to focus their time on the yellow rows. The green rows have already been triangulated by three independent reasoners; the yellow rows are exactly where their attention is most valuable.

That redistribution of human attention — away from the bulk of obviously-good or obviously-bad candidates and toward the ambiguous middle — is the actual product of multi-model scoring. The accuracy improvement is real. The attention improvement is the bigger gain.

Single-model pipelines are simpler. They're also less informative on exactly the candidates that need the most informing. We picked the trade-off the other way.