There is a quiet assumption built into the way most search systems are evaluated. It says: the job of search is to find the right document, and a system that ranks the right document highly enough has done that job. The metrics — NDCG, MRR, recall at k — all reward this view. They are good metrics. They are also incomplete in a way that took us a long time to notice.

We built a search stack we are proud of. Lexical retrieval over the inverted index. Dense retrieval over a learned embedding space. Reciprocal rank fusion to combine the two without forcing their scores onto a single scale. A cross-encoder for reranking the top hundred candidates, trained on relevance pairs and run on GPU in roughly fifty milliseconds. A click-feedback loop with inverse propensity weighting to correct for the position bias that makes any naive CTR signal misleading. Each piece moved the headline metric by a few points. The composition was, by any standard ranking benchmark, very good.

It was also, in a quieter and harder-to-measure way, often wrong. Not in the rankings — those were fine. Wrong in what the system did with the documents once it had them. The model would read three retrieved passages and write a confident summary in which it conflated two of them, or generalized beyond what any of them said, or added a small synthesizing claim that was not supported by anything we had retrieved. The retrieval was correct. The answer was not. None of our metrics caught this, because none of them were measuring the thing we actually cared about: was the answer the user reads true.

So we added one more step. After the model produced its draft, we ran a natural-language inference pass over each claim against each cited source. Not to decide whether the documents matched the query — that battle had already been fought — but to decide whether the claim being made was actually entailed by the cited evidence. Anything below an entailment threshold was either flagged for the user or stripped from the response.

The first time we ran this on our internal corpus, we found that something on the order of forty percent of synthesized claims were either weakly supported or unsupported by the documents the system had cited as sources. The retrieval had been right. The model had been wrong about what its own retrieval said. We were not running a search engine. We were running a confident hallucinator with a citations veneer.

The reframe

This shifted what we believed search was for.

The conventional view treats search as a coverage problem: given a query, surface the documents most likely to contain the answer. The implicit assumption is that the user — or a downstream LLM — will read those documents and form a judgment. The system's responsibility ends at the document.

The view we now hold treats search as a calibration problem: given a query, surface the documents and an honest accounting of what those documents actually establish. The system's responsibility extends to whether the answer it constructs is one a careful reader of those documents would also produce. Coverage is necessary. It is no longer sufficient.

The technical implications are unsubtle. You need verification, not only retrieval. You need a model that can decline to answer when its sources do not justify an answer. You need to make abstention a first-class action rather than a failure of recall. You need to be willing to tell the user "I found these documents but I could not write a confident summary of them," and to mean it.

What this costs

This stack is slower. A query that would resolve in fifty milliseconds against a vanilla BM25 endpoint takes us, on average, eight hundred milliseconds to a couple of seconds, depending on how many candidates survive into the verification stage. We do dense retrieval, sparse retrieval, fusion, cross-encoder rerank, generation, and entailment-checking. None of those steps are cheap.

We accept this. The alternative — fast and confidently wrong — is worse for users, and we believe is worse for the field. A system that loses a hundred milliseconds and finds the answer it deserves is not a tradeoff. It is the trade you should have been making all along.

What it earns you

There is a category of question — most of the interesting questions — where the work is not in finding documents but in deciding what those documents support. Whether two studies actually agree. Whether a source from 2019 has been superseded. Whether a confident-sounding claim is rebutted in the next paragraph nobody read. Calibrated search makes these questions answerable for the first time.

The frame we now use, internally, is this: finding documents is easy; deserving the user's trust is the actual job. Every component of the search stack — ranking, fusion, reranking, verification — is in service of that second sentence. The first one, by itself, is no longer enough.