Why LLM hallucinations have an architectural floor

By Leonardo Leenen · April 26, 2026

In June 2023, a New York lawyer was sanctioned for filing a brief that cited six judicial decisions which did not exist. His tool of choice was ChatGPT. The story was reported widely. Two and a half years later, the equivalent of that brief is being generated daily — by internal compliance assistants, customer-facing chatbots, contract review tools, and analytical agents — across most large enterprises that have begun pilot programs with large language models (LLMs).

The lawyer's case was visible because it reached a courtroom. Most enterprise hallucinations do not. They land in internal reports, draft policies, customer responses, and decision support materials. Some are caught in review. Many are not.

A common reading of these incidents treats hallucinations as a quality problem — something that better prompts, better data, or the next model release will fix. There is a different reading, supported both by the empirical literature on commercial LLM products and by recent theoretical work, that points elsewhere: hallucinations are an architectural property of the current generation of LLMs, and they have a floor that cannot be reduced to zero by engineering alone.

This article explains why. It is written for technical and governance leaders who need to make architectural decisions about LLM-based systems with the actual constraints in view, rather than the marketing version of those constraints.

What "architectural" means here

Saying that hallucinations are architectural is a specific claim. It means they emerge from the design of the model itself — from the training objective, from the way knowledge is stored, from the absence of an internal mechanism for declared uncertainty — and that no fix at the input or output level can fully eliminate them within the current generation of LLMs.

The implication for an enterprise architect is concrete. A system designed under the assumption that hallucinations will be eliminated by the next model is structurally fragile. A system designed under the assumption that hallucinations will occur at a measurable rate, and the architecture's job is to detect and contain them, is governable.

Five mechanisms underpin the claim.

1. The training objective rewards plausibility

The training objective for next-token prediction asks the model to produce the most likely continuation given the context. The reward function asks for outputs that match the statistical distribution of the training corpus. Truth, in any principled sense, is not part of the optimization target.

A statement can be plausible (high probability under the training distribution) and false. A statement can also be true but rare in the training corpus, and therefore receive low probability. The model has no principled way to distinguish these cases. The optimization target produces a system that generates fluent, contextually appropriate sequences. The fluency is independent of factual correctness.

This is the foundation. Every other mechanism builds on it.

2. There is no native representation of declared uncertainty

When the statistical pattern in the training data is weak, sparse, or contradictory, the model continues to generate. It does not pause. It does not abstain. The output reads with the same surface confidence whether the model is recalling a well-attested fact or interpolating across thin evidence.

Work from Anthropic on calibration (Kadavath et al., 2022) showed that LLMs do have some internal sense of when their answers are uncertain — at the level of their probability distributions over tokens. The post-training process, particularly RLHF, then optimizes the model to produce assertive, helpful responses. Users tend to prefer answers over admissions of uncertainty, so the training signal pushes the model toward assertion. The internal calibration exists. The verbal output does not reflect it.

For an enterprise system, this means the LLM cannot be relied on to flag its own uncertainty. The signal that an answer is unreliable has to come from outside the model.

3. Compression is lossy by definition

A modern LLM compresses trillions of training tokens into billions of parameters. The ratio is several orders of magnitude. Compression at that scale is irreversible — the model cannot reconstruct its training data faithfully on demand.

When asked about a specific fact, the model generates the answer by reconstruction from its compressed parametric representation. The retrieval analogy does not apply at the parameter level — there are no records to look up, only weights to evaluate. For well-attested facts repeated across many documents, the reconstruction tends to be accurate. For sparse facts, edge cases, or compositions of facts that did not appear together in training, the reconstruction fills the gap with statistical interpolation. That interpolation is the mechanism of hallucination. The same process produces both the accurate answers and the fabricated ones; the distinction is only visible after the fact.

4. There is a formal impossibility result

In January 2024, Xu, Jain, and Kankanhalli published Hallucination is Inevitable: An Innate Limitation of Large Language Models. They construct a formal argument: for any computable language model, there exist problems that the model cannot learn, which guarantees the existence of inputs on which the model will hallucinate.

The result is theoretical and worth treating with appropriate scope. The claim is bounded — no model within the current generation can be made hallucination-free across all inputs. The result does not predict that every output will be wrong, and it does not preclude future architectural changes that escape the assumptions of the proof. Within the current generation of LLMs, hallucination mitigation is a question of degree rather than of elimination.

For governance purposes, this changes the conversation. A vendor claim of "zero hallucinations" is a claim that contradicts a published impossibility result. The burden of evidence on such claims should be set accordingly.

5. Training incentives reward guessing

In a paper from September 2025, Why Language Models Hallucinate, OpenAI researchers argue that the persistence of hallucinations is partly explained by how models are evaluated and trained. Most benchmarks reward correct answers and penalize incorrect ones, but they do not consistently reward abstention. A model that says "I don't know" gets the same score as a model that gives a wrong answer. A model that guesses correctly is rewarded as much as a model that knows the answer.

Under this incentive structure, a well-optimized model learns that guessing pays. The behavior that emerges in deployment is consistent with the training reward: when the model is unsure, it produces an answer with the surface form of certainty.

The implication is structural at a different layer. The fix lives in two places — in model architecture, and in how the industry evaluates progress. Until benchmarks systematically reward abstention, models will continue to be trained to guess.

The RAG ceiling

The most common proposed mitigation for hallucinations in enterprise contexts is retrieval-augmented generation (RAG): retrieve relevant documents, pass them to the model as context, and have the model generate its answer grounded in that retrieved content. Several vendors have marketed RAG-based systems as "hallucination-free" or as having "eliminated" the problem.

The most rigorous empirical evaluation of these claims comes from a Stanford and Yale team — Magesh, Surani, Dahl, Suzgun, Manning, and Ho — in Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. The study preregistered over 200 legal queries and ran them against the leading commercial AI legal research tools (LexisNexis's Lexis+ AI and Thomson Reuters's Ask Practical Law AI), with GPT-4 as a baseline.

The reported hallucination rate for the commercial tools was above 17%. Thomson Reuters's tool returned incomplete answers more than 60% of the time. These are tools built on bounded domains, on curated authoritative data accumulated over decades, by engineering teams with substantial NLP expertise.

The reading from this evidence: RAG reduces hallucinations relative to a bare LLM, and it does so meaningfully. The reduction has a measurable floor above zero. The five mechanisms described above remain in operation regardless of whether the context window contains retrieved documents. A separate article will examine why RAG fails in specific ways and what architectural choices mitigate those failures.

Implications for governance and architecture

If the floor is real, the engineering question reframes. Detection, containment, and audit evidence become the design center of gravity, replacing the previous focus on prevention.

This shift moves the design effort from prompt engineering and model selection to verification, monitoring, and post-generation control. It treats the LLM as a probabilistic component whose outputs need to be checked by other systems — and in some cases by humans — before reaching a decision-relevant context.

For European enterprises, this framing aligns with what the EU AI Act requires for high-risk AI systems. Articles 9 through 15 specify obligations around risk management, data governance, transparency, human oversight, and post-market monitoring. Each of these obligations requires operational evidence, generated continuously. The evidence emerges from the system's behavior over time and from the controls placed around it.

What to do about it

For technical and governance leaders making decisions today, the following actions follow from the analysis above:

Design with a non-zero hallucination assumption. Treat the hallucination rate as a parameter that needs to be measured, bounded, and reported. An architecture that depends on the LLM being right on its own does not survive contact with production traffic.
Invest in detection more than in prevention. Verification systems — citation checking, factual cross-validation, consistency checks across multiple generations — return more value per engineering hour than additional prompt iteration.
Audit continuously in production. Hallucinations appear in the long tail of real query distributions. Pilot evaluations describe pilot conditions; a 100-prompt evaluation set says very little about behavior at scale.
Build adversarial test sets. Include false-premise queries, out-of-distribution inputs, and compositional edge cases. The commercial tools studied by Stanford failed precisely on these categories.
Track abstention rate as a primary metric. A system that says "I don't know" when appropriate is healthier than one that always answers. Refusal capacity should be measured and reported alongside accuracy.
Calibrate human-in-the-loop coverage to risk level. High-stakes outputs require human review. Lower-risk outputs can pass through with monitoring only. Define the threshold explicitly, document it, and make it auditable.
Question vendor "hallucination-free" claims. Ask for the evaluation methodology, the dataset, and the reported error rate. If those artifacts do not exist, the claim should be treated as unsupported.
Generate AI Act–ready documentation from day one. Risk management files, post-market monitoring records, and human oversight evidence are easier to build incrementally than to reconstruct under regulatory pressure.
Make uncertainty visible in the user interface. If the system cannot reliably know its own confidence, the user should be told that explicitly. A disclaimer in fine print does not satisfy this.

The reading of the current evidence is that the industry has reached a point where the architectural limits of LLM-based systems are visible, both empirically and theoretically. Decisions made on the basis of those limits will produce more durable systems than decisions made on the basis of vendor promises that contradict them.

References

Magesh, V., Surani, F., Dahl, M., Suzgun, M., Manning, C. D., & Ho, D. E. (2024). Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. Stanford University & Yale University. arXiv preprint.
Xu, Z., Jain, S., & Kankanhalli, M. (2024). Hallucination is Inevitable: An Innate Limitation of Large Language Models. arXiv:2401.11817.
Kadavath, S., et al. (2022). Language Models (Mostly) Know What They Know. Anthropic. arXiv:2207.05221.
Kalai, A. T., et al. (2025). Why Language Models Hallucinate. OpenAI.
European Parliament and Council. (2024). Regulation (EU) 2024/1689 — Artificial Intelligence Act. Articles 9–15.
Weiser, B. (2023). Here's What Happens When Your Lawyer Uses ChatGPT. The New York Times, May 27, 2023.

Why LLM hallucinations have an architectural floor

What "architectural" means here

1. The training objective rewards plausibility

2. There is no native representation of declared uncertainty

3. Compression is lossy by definition

4. There is a formal impossibility result

5. Training incentives reward guessing

The RAG ceiling

Implications for governance and architecture

What to do about it

References

Products

Platform

Industries

Company

Legal