Announcing Built On Envoy: Making Envoy Extensions Accessible to Everyone

Learn more

Two Ways to Build a Cost Agent (And Why We Use Both)

We built two fundamentally different architectures for our cost optimization agents. One lets the LLM drive. The other relegates it to a single call. Both have their place.

Two Ways to Build a Cost Agent (And Why We Use Both)

In the first post in this series, we introduced the cost optimization agents we built to scan our AWS, GCP, and Azure environments for waste. In the second, we covered the humbling gap analysis that showed our agent was catching 2.4% of available savings. This post is about the architectural question that came next: how much should the LLM actually control?

We ended up with two fundamentally different patterns across our agents, and the journey between them taught us more about agent architecture than any conference talk ever could.

Tetrate Agent Router Enterprise provides continuous runtime governance for GenAI systems. Enforce policies, control costs, and maintain compliance at the infrastructure layer — without touching application code.

Learn more

Pattern A: Let the LLM Drive

Our AWS agent is what most people picture when they think “AI agent.” It’s a Pydantic AI agent with about 16 tools, a system prompt that describes the cost optimization task, and a simple instruction: analyze these accounts, find waste, save findings.

The LLM decides which accounts to check first. It calls the tool to pull billing data, looks at the spending breakdown, decides which resource categories to investigate, calls the appropriate analysis tools, interprets the results, and creates findings. If it encounters an account it can’t access (role assumption failed), it notes the error and moves on. The entire workflow is emergent from the LLM’s reasoning about the tools available to it.

This works. The agent genuinely does reason about where to look. When it sees that 60% of an account’s spend is in “EC2-Other,” it will investigate NAT gateways and data transfer costs. When it finds a stopped RDS instance in a development account, it considers the account context (human-provided labels about each account’s purpose) before deciding whether that’s a real finding or expected behavior. The judgment calls are good.

But the pattern has three significant problems.

It’s expensive per run. Analyzing 20+ AWS accounts means 150+ tool calls, each one requiring the LLM to process the full conversation history. The context window fills up with tool call results, and by the end of a long run the model is spending most of its input tokens re-reading the findings from accounts it analyzed an hour ago. Every new tool call costs more than the last one.

It’s inconsistent. Run the agent twice on the same environment and you’ll get slightly different coverage. The LLM might prioritize different accounts, investigate different resource types, or decide to skip an area it covered thoroughly last time. For a weekly automated scan, that inconsistency matters. You want to know that every account gets the same thorough analysis every time.

It doesn’t scale linearly. Adding more accounts doesn’t just increase runtime proportionally. The growing context window means each additional account is more expensive than the last. And there’s a hard ceiling: eventually the conversation hits the model’s context limit and the agent has to stop, whether or not it’s finished. We set a request limit of 150 tool calls, which works for 20-25 accounts, but is arbitrary and fragile.

Pattern B: Discovery First, Then Assess

When we built the GCP agent, we made a deliberate choice to do things differently. Instead of handing the LLM the keys and saying “figure it out,” we split the work into two distinct phases.

Phase 1 is pure Python. No LLM involvement whatsoever. A GCPDiscoveryEngine enumerates the top-spending projects using billing data, then launches parallel threads (via ThreadPoolExecutor with 10 workers) to analyze each project simultaneously. Each thread checks for idle VMs, stopped instances, unattached disks, unused IPs, idle Cloud SQL databases, idle NAT gateways, idle or overprovisioned GKE clusters, and buckets without lifecycle policies.

The output of Phase 1 is a list of RawFinding objects: simple dataclasses with a finding type, resource ID, estimated cost, and a few details. No natural language, no severity assessments, no recommendations. Just structured data about what was discovered.

Phase 2 is a single LLM call. All the raw findings from Phase 1 are batched together and sent to the LLM in one request. The model’s job is assessment: assign severity, validate that the finding makes sense, provide recommendations, and prioritize. If the LLM call fails for any reason, there’s a fallback path that converts raw findings directly into cost findings using deterministic rules.

The difference in practice is significant.

Phase 1 can scan 15 projects in parallel in about the time it takes to scan one project sequentially. Adding more projects is nearly free because the threads are I/O-bound on GCP API calls, not token-bound on LLM context. Discovery across the entire GCP organization takes a few minutes regardless of how many projects exist.

Phase 2 costs one LLM call. Not 150. One. The entire set of findings, often dozens or hundreds of items, gets assessed in a single pass. If the LLM flakes out, the fallback produces perfectly usable findings without any natural language polish.

Why Not Just Use Pattern B Everywhere?

If two-phase is cheaper, faster, and more predictable, why do we still use Pattern A for AWS?

Because Pattern A is genuinely better at analysis that requires judgment.

Consider a NAT gateway data transfer analysis. The question isn’t “is this NAT gateway idle?” (a binary check that Pattern B handles easily). The question is: “Is the traffic flowing through this gateway mostly S3-bound? Would a VPC Gateway Endpoint eliminate most of the cost? What’s the traffic pattern look like, and is this a steady baseline or a spike from a batch job?”

That analysis requires pulling multiple data sources, reasoning about them in context, and making a judgment call that depends on account-level knowledge. The LLM is good at this. Pattern B’s deterministic discovery can flag “this NAT gateway processes a lot of traffic” but can’t reason about whether the traffic is avoidable.

Similarly, the AWS agent’s ability to read account context and adjust its analysis accordingly is something that emerges naturally from LLM-orchestrated tool calling. The agent knows that a development account’s idle resources are probably fine to terminate, while a production account’s quiet-looking instance might be a standby for failover. This contextual reasoning is hard to encode in deterministic rules.

The Honest Trade-off Matrix

After running both patterns in production for several months, the trade-offs are clear:

Pattern A (LLM-Orchestrated) wins when:

  • The analysis requires contextual reasoning across multiple data sources
  • Account-level context (purpose, team, criticality) changes what counts as a finding
  • The problem space is open-ended enough that you can’t enumerate all the checks in advance
  • You’re building the first version and still discovering what the agent should look for

Pattern B (Two-Phase) wins when:

  • Discovery is enumerable: you know what resource types to check and what “idle” means for each
  • The environment is large enough that per-item LLM costs add up (dozens of projects or accounts)
  • You need consistent, reproducible coverage across every run
  • The assessment step is relatively mechanical once you have the raw data

The key insight is that these patterns aren’t competing philosophies. They’re tools for different parts of the problem. We use Pattern A where the LLM’s judgment is the value, and Pattern B where the LLM’s judgment is a nice-to-have on top of a fundamentally enumerable task.

The Evolution Was the Learning

Looking back, the most valuable thing about having both patterns isn’t the cost savings or the performance improvement of Pattern B. It’s that building Pattern B forced us to articulate exactly what the LLM was doing in Pattern A.

When you hand an LLM a set of tools and say “find waste,” it’s easy to fool yourself into thinking the intelligence is in the LLM. But when you build Pattern B and discover that 80% of what the LLM was doing in Pattern A can be done with a for loop and some API calls, you get a much sharper picture of where the LLM actually adds value.

The LLM isn’t adding value when it calls list_accounts(). It’s adding value when it decides that the NAT gateway in the production account warrants deeper investigation based on traffic patterns, or when it correlates the account context (“this is a shared services account”) with the resource analysis (“these EC2 instances are used by other teams”) to avoid a false positive.

Separating the mechanical work from the judgment work made both halves better. Pattern B’s discovery engine finds things the LLM would have skipped or deprioritized in Pattern A. Pattern A’s contextual reasoning produces higher-quality findings than Pattern B’s deterministic assessment ever could.

What This Means for Your Agent

If you’re building an agent that operates at scale, ask yourself two questions:

First: can I enumerate the discovery work? If you can write down all the things to check and define what “interesting” means for each one, that’s a candidate for Phase 1 deterministic discovery. Let the LLM assess and prioritize, not discover.

Second: where does the LLM’s judgment actually change the outcome? Not where it’s involved, but where it’s making decisions that you couldn’t hard-code. That’s where Pattern A’s flexibility earns its cost. Everywhere else, you’re paying for a very expensive for loop.

We’ll cover more on this theme in a later post about when to remove the LLM entirely. But first, the next post in the series covers the reliability patterns that both patterns depend on: auto-save, stable IDs, and the context window problem.


Agent Router Enterprise provides the infrastructure layer that makes multi-pattern architectures practical: centralized LLM routing so both patterns use the same gateway, per-agent cost attribution so you can actually measure Pattern A vs Pattern B costs, and continuous supervision to ensure your agents maintain quality as your environment changes. Learn more here ›

Product background Product background for tablets
Building AI agents

Agent Router Enterprise provides managed LLM & MCP Gateways plus AI Guardrails in your dedicated instance. Graduate agents from prototype to production with consistent model access, governed tool use, and runtime supervision — built on Envoy AI Gateway by its creators.

  • LLM Gateway – Unified model catalog with automatic fallback across providers
  • MCP Gateway – Curated tool access with per-profile authentication and filtering
  • AI Guardrails – Enforce policies, prevent data loss, and supervise agent behavior
  • Learn more
    Replacing NGINX Ingress

    Tetrate Enterprise Gateway for Envoy (TEG) is the enterprise-ready replacement for NGINX Ingress Controller. Built on Envoy Gateway and the Kubernetes Gateway API, TEG delivers advanced traffic management, security, and observability without vendor lock-in.

  • 100% upstream Envoy Gateway – CVE-protected builds
  • Kubernetes Gateway API native – Modern, portable, and extensible ingress
  • Enterprise-grade support – 24/7 production support from Envoy experts
  • Learn more
    Decorative CTA background pattern background background
    Tetrate logo in the CTA section Tetrate logo in the CTA section for mobile

    Ready to enhance your
    network

    with more
    intelligence?