The Tier 3 Problem: Why Banks Can't Use LLMs for Real Decisions
Banks are leaving $920bn in operational efficiency gains on the table because they can't get LLMs past their risk teams. The issue isn't caution—it's that current LLMs can't satisfy SR 11-7 requirements.
Morgan Stanley estimates that generative AI could unlock around $920 billion in annual operational efficiency gains for financial services. Banks are capturing approximately none of that.
Not because they’re not trying. Every major bank has an AI strategy, a Center of Excellence, probably a Chief AI Officer by now. They’ve got pilots. They’ve got proofs of concept. What they don’t have is LLMs making real decisions in production.
Tetrate offers an enterprise-ready, 100% upstream distribution of Istio, Tetrate Istio Subscription (TIS). TIS is the easiest way to get started with Istio for production use cases. TIS+, a hosted Day 2 operations solution for Istio, adds a global service registry, unified Istio metrics dashboard, and self-service troubleshooting.
The Tier 3 Ghetto
If you’ve spent time in a bank’s AI governance process, you’ve encountered the tiering system. It goes something like this:
Tier 3: Low-risk, generative use cases. Chatbots for internal IT help desks. Document summarisation. Code assistance for developers. Meeting note generators. Nice to have, low stakes.
Tier 2: Decision-support use cases. Systems that help humans make decisions but don’t make decisions themselves. An analyst reviews the output before anything happens.
Tier 1: Automated decision-making. The system makes or directly drives decisions without human review. Credit adjudication. Risk classification. Compliance determinations.
The economic value lives in Tier 1 and Tier 2. That’s where you replace manual processes, reduce cycle times, and scale decisions that currently require expensive human judgment.
Most bank LLM deployments are stuck in Tier 3.
The chatbots are fine. The summarisation tools are genuinely useful. But these are productivity enhancements, not transformation. The gap between “our developers have GitHub Copilot” and “our credit decisions are automated” is measured in billions of dollars—and banks can’t close it.
What Banks Actually Want to Do
The use cases sitting in the “we’d love to, but we can’t” pile are substantial:
Automated credit decisioning. Not “assist the analyst”—actual yes/no adjudication. Structured credit underwriting with deterministic extraction, scoring, and narrative justification. SME and commercial credit assessment based on document analysis.
KYC/AML/Fraud classification. Stable risk signals. Repeatable decisions that produce the same answer when you run them twice. Systems that can be audited because they behave consistently.
Document intelligence in regulated workflows. Extraction that feeds downstream scoring models. Income verification, employment confirmation, risk-relevant feature extraction from unstructured documents. Not as an internal convenience tool—as part of the actual decision pipeline.
Compliance automation. Consistent interpretation of regulatory rules. Deterministic classification of transactions, communications, or client activities.
Supervisory reporting. Reliable data extraction from documents that feeds regulatory reporting pipelines. The kind of thing you really don’t want to get wrong.
Each of these represents meaningful operational leverage. Each is squarely within what LLMs can technically do. And each is effectively off-limits under current governance frameworks.
The SR 11-7 Wall
The obstacle isn’t that banks are excessively cautious (though they are, and as a customer I appreciate it). The obstacle is regulatory.
In the US, the governing framework is SR 11-7, a Federal Reserve guidance letter that emerged from the wreckage of 2008. It turns out that when half your banks are running critical risk calculations in Excel spreadsheets that get emailed around weekly, you might want some standards.
SR 11-7 applies to “any method, system, or approach that processes input data to produce estimates, scores, classifications, or decisions.” That’s broad. Critically, it covers non-numeric outputs used to make or support decisions. Your LLM doesn’t have to output a number to fall in scope—it just has to influence something that matters.
The regulation itself is sensible. It requires things like:
- A clear definition of the model and how it works
- Comprehensive validation including independent review
- Understanding of the training data and its limitations
- Stable, reproducible behaviour that can be tested
- Documentation sufficient for auditors and examiners
These are reasonable requirements for systems that make consequential decisions. The problem is that current LLMs can’t satisfy them.
The Gap
Here’s where banks are stuck:
You can’t validate what you can’t reproduce. LLMs are non-deterministic by design. Run the same prompt twice, get different outputs. Traditional model validation assumes you can measure accuracy, track drift, and regression test changes. Non-determinism breaks all of that.
You can’t document what you can’t see. Proprietary foundation models don’t disclose their training data. “Trust me, the data is representative” doesn’t satisfy examiners who want curated data lineage and documented limitations.
You can’t control what vendors change without telling you. Foundation model providers update models silently. Your January validation might be irrelevant by March. The thing you tested isn’t the thing running in production.
You can’t explain what you don’t understand. SR 11-7 implicitly requires explainability—a defensible theory of why the model behaves as it does. LLMs can’t provide causal explanations for their outputs.
Banks aren’t being difficult. They’re looking at the regulatory requirements, looking at the capabilities of current LLMs, and correctly concluding that the gap is unbridgeable. So they deploy chatbots and wait.
There’s a Way Through
This is the first in a series of posts exploring how that gap might close.
The short version: recent research has demonstrated deterministic LLM inference under controlled conditions. Determinism doesn’t solve every SR 11-7 challenge, but it solves the reproducibility problem—which unlocks validation, monitoring, and change management.
Combine determinism with transparency (open-weight models with documented training data), and suddenly LLMs start looking like systems that can sit inside traditional Model Risk Management frameworks.
In the next post, we’ll dig into why non-determinism specifically breaks MRM validation. After that, we’ll cover the transparency problem. And finally, we’ll look at the specific use cases that become tractable when you have both determinism and transparency.
The Tier 3 ghetto isn’t permanent. But escaping it requires solving real technical and governance problems—not just waiting for regulators to get comfortable.
Agent Router Enterprise helps teams graduate AI agents from prototype to production with centralized LLM routing, AI Guardrails for consistent policy enforcement, and continuous supervision through behavioral metrics. When you’re ready to move beyond Tier 3, the infrastructure matters. Learn more here ›