Not Everything Needs an LLM: When to Remove the AI from Your AI Agent
We built an agent to sync compliance data. Then we built a version without the LLM that runs faster, costs less, and produces identical results. Knowing when to remove the AI is an underrated skill.
One of our agents syncs compliance findings from a governance platform into our dashboard. The first version used an LLM to orchestrate the sync: fetch the failing monitors, expand them into individual resource findings, save each one to the database. The agent worked well, and the LLM did a perfectly competent job of calling tools in the right order.
Then we built a second version that does the exact same thing without the LLM. Direct API calls, a loop, and some status-preservation logic. It runs faster, costs nothing in inference, and produces identical results.
Tetrate Agent Router Enterprise provides continuous runtime governance for GenAI systems. Enforce policies, control costs, and maintain compliance at the infrastructure layer — without touching application code.
The LLM version still exists. We use it for complex operations where the agent needs to reason about what it’s seeing. But for the daily sync, the straight-through path is better in every measurable dimension. This experience reshaped how we think about where LLMs belong in agent architectures.
The Compliance Sync: Two Implementations
The task is straightforward. A compliance platform tracks whether our infrastructure meets various security standards: are VPC flow logs enabled, are storage buckets encrypted, are firewall rules properly restrictive, and so on. Failing monitors produce findings. We want those findings in our dashboard.
Version 1 (LLM-orchestrated): A Pydantic AI agent with tools. The LLM receives a prompt saying “sync findings from the compliance platform to the database.” It calls a tool to fetch all failing monitors, calls another tool to expand monitors into individual resource findings, and calls a save tool for each finding. The LLM decides the order, handles edge cases, and produces a summary at the end.
Version 2 (direct sync): No LLM. The code calls the compliance platform API, iterates over the results, and writes each finding to Firestore. Status preservation (don’t overwrite a finding that’s been acknowledged, in progress, resolved, or dismissed) is handled by checking the existing record before writing.
Version 2 runs in a fraction of the time. It uses zero inference tokens. The results are byte-for-byte identical to what the LLM version produces, because the LLM wasn’t actually making any decisions. It was calling tools in a fixed sequence, which is what a for loop does.
Where the LLM Was Wasting Its Talent
When we looked closely at what the LLM was doing in the compliance sync, the answer was: orchestrating a deterministic workflow. Fetch data, transform it, save it. There were no judgment calls, no ambiguity, no cases where the LLM’s reasoning led to a different outcome than the deterministic path would have. The LLM was an expensive, slower, probabilistic replacement for sequential function calls.
This pattern is more common than people want to admit. It’s tempting to build everything as an “agent” because agents feel sophisticated and LLMs are genuinely impressive. But running a deterministic task through an LLM is like hiring a chess grandmaster to sort your mail. They’ll do it, and they’ll probably do it correctly, but you’re paying for capabilities you’re not using.
The question to ask is not “can the LLM do this?” (the answer is almost always yes) but “does the LLM’s judgment change the outcome?” If the answer is no, you don’t need an LLM.
Where the LLM Actually Earns Its Keep
Removing the LLM from the compliance sync didn’t mean removing it from the compliance agent entirely. There are tasks in the same domain where the LLM genuinely adds value:
Remediation reasoning. When the agent generates remediation commands for failing compliance checks, it needs to consider the specific infrastructure context (GCP vs AWS, the resource type, the current configuration) and produce commands with appropriate warnings about risks. A recommendation to restrict SSH firewall rules needs to account for whether IAP is set up, whether there are bastion hosts, and what the blast radius of a misconfiguration would be. This requires judgment, not just template filling.
Alert triage. We run a separate agent that triages vulnerability alerts from a security scanner. The triage decision is genuinely one that benefits from LLM reasoning: a vulnerability rated “Critical” by the scanner might actually be low risk if it’s on an internal-only resource with no network exposure. This requires correlating the vulnerability data with infrastructure context and making a judgment call about actual exploitability. Our benchmarks showed meaningful differences between models here. One model was fast but tended toward extremes (everything was either informational or critical). A more capable model produced nuanced assessments that correctly accounted for exposure and context, at roughly 15 times the per-alert cost.
Interactive investigation. We built a chatbot that lets engineers drill into cost data through natural language. Users ask questions like “what’s driving costs in this GCP project?” and the chatbot queries billing data, checks resource utilization, and synthesizes an answer. This is a genuine LLM use case: the questions are open-ended, the analysis requires combining multiple data sources, and the natural language response is the whole point.
The Decision Framework
After building agents across cost optimization, compliance, and security triage, a pattern emerged for when to use (and not use) LLMs:
Remove the LLM when:
- The task is a fixed sequence of API calls with no branching logic
- The “decisions” the LLM makes would be identical to what a deterministic code path would produce
- The data transformation is structural (map fields, filter records, format output)
- You need the task to run on a tight schedule with predictable costs
- Consistency between runs matters more than flexibility within a run
Keep the LLM when:
- The task requires correlating information across multiple sources to reach a judgment
- Context changes the right answer (the same finding might be critical in production and ignorable in dev)
- The output is natural language that needs to be coherent and contextually appropriate
- The problem space is open-ended enough that you can’t enumerate all the cases
- You’re still discovering what the agent should do (the LLM as exploration tool during development)
The last point is worth emphasizing. LLMs are excellent for prototyping workflows. Build the first version as an LLM-orchestrated agent, watch what it does, learn from the patterns, and then harden the deterministic parts into code. The compliance sync started as an LLM agent partly because we didn’t know exactly what the sync logic needed to look like until we watched the LLM figure it out. Once the pattern was clear, we extracted it into deterministic code.
The Meta-Question: What Does Your Agent Cost to Run?
There’s a practical dimension to this that’s easy to overlook when you’re building: inference costs add up.
We track LLM costs per agent through dedicated API keys routed through a centralized gateway. Each agent has its own key, so we can see exactly what the cost agent costs versus the compliance agent versus the security triage agent. This per-agent attribution made the compliance sync decision obvious. The LLM version had a measurable per-run cost for zero additional value over the direct path.
For the interactive chatbot, the cost calculation is different. Every user question triggers at least one LLM inference call, often with multiple tool calls. The chatbot is the most expensive component per interaction. But the value proposition is also clear: it lets engineers investigate cost patterns in natural language instead of writing BigQuery queries, and it can correlate billing data with live infrastructure checks in a way that no dashboard can.
The interesting question is whether the chatbot’s investigations actually lead to savings that exceed its operating cost. We’re building agents to reduce cloud spend. The chatbot exists to help humans find and act on savings. If the chatbot’s inference costs grow faster than the savings it helps identify, you’ve created a system where the cost of finding the cost savings is larger than the cost savings. That’s a failure mode worth watching for, and it’s the kind of thing that per-agent cost attribution makes visible.
Building the Hybrid
The architecture we’ve converged on is hybrid by default. Each agent has a direct path for deterministic operations and an LLM path for reasoning-heavy operations. The compliance agent uses direct sync for the daily data pull and the LLM agent for remediation analysis. The cost agent uses deterministic discovery for GCP (as covered in a previous post) and LLM-orchestrated analysis for AWS, where the contextual reasoning adds value.
This hybrid approach has a nice property: it makes you articulate, for each part of each workflow, whether the LLM is earning its cost. Not “is the LLM involved” but “would the outcome be different without it.” If the answer is no, extract it. If the answer is yes, keep it, and make sure you’re measuring whether that judgment is actually good.
The best agents, in our experience, use less LLM than you’d expect. The LLM handles the parts that genuinely require reasoning, and everything else is just code. The skill isn’t in knowing how to add AI to a workflow. It’s in knowing when to take it out.
Agent Router Enterprise helps teams build hybrid agent architectures with confidence. Per-agent API keys provide cost attribution so you can measure what each agent actually spends on inference. Behavioral metrics compare LLM-driven and direct paths objectively. And the LLM Gateway ensures that when you do use inference, it’s routed, monitored, and governed from the infrastructure layer. Learn more here ›