You Can't Validate What Never Gives the Same Answer Twice
Your model risk team wants to validate your LLM. They run the same test twice. They get different answers. Meeting adjourned.
Your model risk team wants to validate your LLM. They run the same test twice. They get different answers. Meeting adjourned.
This is not a hypothetical scenario. It’s what happens when you try to apply traditional model validation frameworks to language models. The core assumption that makes validation possible—that the same input produces the same output—doesn’t hold.
Tetrate offers an enterprise-ready, 100% upstream distribution of Istio, Tetrate Istio Subscription (TIS). TIS is the easiest way to get started with Istio for production use cases. TIS+, a hosted Day 2 operations solution for Istio, adds a global service registry, unified Istio metrics dashboard, and self-service troubleshooting.
How Traditional Validation Works
Model Risk Management validation isn’t complicated in concept. You have a model. You want to know if it works correctly and safely. So you:
- Run the model on a test dataset with known outcomes
- Measure accuracy, precision, recall, whatever metrics matter for your use case
- Document the results
- Have independent validators reproduce those results
- Track performance over time to detect drift
- Regression test when you make changes
This all assumes something obvious: same input → same output. If you feed the model the same data, it gives you the same answer. That’s what makes measurement meaningful. That’s what makes independent validation possible. That’s what makes before/after comparisons valid.
LLMs break this assumption.
Non-Determinism Isn’t a Bug
When your credit scoring model gives different answers to the same application, you have a bug. When your LLM gives different answers to the same prompt, you have… a language model working as designed.
Non-determinism comes from multiple sources. Temperature settings introduce randomness to make outputs more creative and less repetitive. Sampling strategies (top-p, top-k) select probabilistically from candidate tokens. Even at temperature zero, GPU parallelism can introduce variance—different execution orders produce slightly different floating-point results, which cascade through the attention layers.
The result: same prompt, same model, same day → potentially different output. Same prompt, same model, different day → almost certainly different output.
This isn’t something you can configure away without significant trade-offs. The randomness is part of why the outputs are coherent and natural. Remove it entirely and you get repetitive, mechanical text that fails at the actual job.
The Practical Consequences
Let’s make this concrete with a scenario every bank AI team will recognize.
Scenario 1: Validating a credit decisioning LLM
You’re building a system that reads loan applications and outputs approval recommendations. Your validation team runs 1,000 test cases. The model achieves 94% accuracy—solid performance.
One week later, they re-run the same 1,000 test cases to verify. The model achieves 91% accuracy.
What happened? Did the model get worse? Is there a bug in your pipeline? Or is this just sampling variance from non-deterministic outputs?
You don’t know. More importantly, you can’t know, because you can’t distinguish signal from noise. The independent validators can’t reproduce your results. The auditor asks “what’s the error rate?” and you have to say “somewhere between 91% and 94%, depending on when you measure.”
That’s not a defensible answer in a regulatory examination.
Scenario 2: Regression testing after a prompt update
You’ve refined the system prompt to handle edge cases better. Time to verify the change worked as intended.
You run your test suite before and after. The results are… different. Some cases improved, some got worse, the overall numbers moved a bit.
Is that the prompt change? Or just noise? You can’t isolate the effect of your modification from the baseline variance. Every comparison is confounded by non-determinism.
This makes iterative improvement essentially guesswork. You’re trying to navigate by a compass that spins randomly.
What SR 11-7 Actually Requires
Let’s look at what US bank regulators expect for model validation. SR 11-7 requires:
Conceptual soundness: Can you explain why the model should work? Does it make theoretical sense? For LLMs, this is already challenging—but at least it’s a one-time analysis.
Outcomes analysis: Does the model perform well on actual data? This requires measuring performance, which requires stable measurements. Non-determinism breaks this.
Independent validation: Can a third party reproduce your results? Non-determinism breaks this.
Ongoing monitoring: Can you detect when the model’s behavior changes? This requires distinguishing real drift from sampling variance. Non-determinism breaks this.
Statistical testing: Are your performance claims statistically meaningful? Non-determinism inflates your confidence intervals to the point of uselessness.
Three out of five core requirements become impossible or severely compromised.
The Monitoring Problem
Even after you somehow get the model validated and deployed, you face ongoing monitoring. You need to detect drift—when the model’s behavior changes in ways that matter.
With traditional models, drift detection is straightforward. You track key metrics over time. When they move outside expected bounds, you investigate.
With non-deterministic LLMs, what counts as drift? If your daily accuracy measurement varies from 90% to 94% due to sampling variance, a drop to 89% might be noise or might be signal. You’ve lost the ability to detect real degradation until it becomes so severe that it breaks through the noise floor.
You’re trying to spot a pattern in static. By the time the signal is clear, the damage is done.
Why This Matters
Banks aren’t avoiding LLMs for Tier 1 use cases because they’re overly cautious or technologically conservative. They’re avoiding them because the fundamental validation framework doesn’t work.
SR 11-7 was written for a world where models behave consistently. It assumes you can measure, reproduce, and compare. When those assumptions fail, the entire framework breaks down.
This isn’t a problem you can solve with more documentation or better test cases or larger sample sizes. It’s a structural incompatibility between how LLMs work and how model validation works.
The previous post in this series covered the commercial stakes—the billions in value locked behind the regulatory wall. This post explains one of the technical barriers creating that wall.
The next post covers another barrier: the opacity of proprietary models. Even if you had deterministic outputs, you’d still struggle with SR 11-7’s documentation requirements when the model vendor won’t tell you what’s inside.
Agent Router Enterprise provides AI Guardrails and behavioral metrics for continuous agent supervision—but meaningful measurement still requires outputs you can actually measure consistently. The infrastructure layer can enforce policies and capture telemetry, but the validation problem is upstream: you need deterministic behavior before you can validate reliably. Learn more here ›