We Built an AI Agent to Cut Our Cloud Bill in Half
Our cloud bill was attracting board-level attention. Instead of hiring a FinOps team, we built AI agents that scan AWS, GCP, and Azure weekly. Here's what we learned.
Our cloud bill had grown to a point where it was attracting board-level attention. The target: cut it in half within a year. That’s not a vague aspiration — it’s a line item in a spreadsheet that someone is accountable for.
We didn’t hire a FinOps team. We built an AI agent.
Tetrate Agent Router Enterprise provides continuous runtime governance for GenAI systems. Enforce policies, control costs, and maintain compliance at the infrastructure layer — without touching application code.
This is the first in a series of posts about what we learned building cost optimization agents that operate across AWS, GCP, and Azure. Not the polished conference version — the actual messy reality of building agents that need to find real money in production infrastructure.
The Problem
Cloud cost optimization sounds straightforward until you try to do it at scale. We have dozens of AWS accounts, multiple GCP projects, and Azure subscriptions. Resources spin up for demos, experiments, and customer engagements, and some of them never spin down. People move teams. Projects get deprioritized. The infrastructure stays.
The manual approach — logging into each cloud console, checking utilization dashboards, filing tickets to delete things — doesn’t work when you have this many accounts. An engineer might spend a day auditing one account, find a few hundred dollars in savings, and then never do it again because they have actual work to do.
We needed something that could scan every account, every week, automatically. Something that could reason about whether a resource was actually idle versus just quiet over the weekend. Something that could prioritize findings by impact and track whether humans had already reviewed and dismissed them.
So we built agents.
What We Built
The system is three cloud-specific agents — one each for AWS, GCP, and Azure — plus a shared dashboard and persistence layer. Each agent runs weekly on a schedule, scans its cloud environment, and produces findings: specific resources that are idle, over-provisioned, or misconfigured, with severity ratings and estimated monthly savings.
The agents run on Modal (a serverless compute platform), persist findings to Firestore, and route all LLM calls through a centralized gateway that handles model selection, cost tracking, and observability.
For AWS, the agent has about 16 tools: it can list accounts, pull billing data, analyze EC2 utilization and rightsizing, check for unattached EBS volumes, idle RDS databases, unused Elastic IPs, idle NAT gateways (including data transfer analysis), load balancers with no healthy targets, orphaned ENIs, and over-provisioned EKS clusters. It also reads human-provided context about each account — what it’s for, which team owns it, what’s expected to be running there — so it doesn’t flag intentional infrastructure as waste.
GCP and Azure have equivalent capabilities, adapted for each cloud’s APIs and resource types.
The Four Design Challenges That Shaped Everything
Building the agents was the easy part. Making them reliable, accurate, and useful enough that people actually trusted the output — that was harder. Four problems dominated our thinking:
1. How much should the LLM control?
Our first instinct was to let the LLM orchestrate everything. Give it tools, give it a system prompt, let it figure out which accounts to check, which resources to analyze, and what to report. This worked, but it had problems: the LLM would sometimes skip accounts, over-focus on one area, or make inconsistent prioritization decisions across runs.
We ended up with two different architectural patterns across our agents, and the tension between them taught us a lot about where LLMs add value versus where deterministic code is more reliable. (We’ll cover this in a future post.)
2. Context windows are finite, but cloud environments aren’t
An agent analyzing 20+ AWS accounts generates a lot of data. Tool call results pile up. The context window fills. We needed findings to survive even if the agent hit its token limit or timed out mid-run.
This led us to an auto-save pattern where findings are persisted to the database as they’re discovered, not collected and saved at the end. It also meant using stable, deterministic IDs for findings (based on cloud provider + account + resource) so that the same finding doesn’t get duplicated across weekly runs, and human decisions to dismiss a finding are preserved. These small design choices turned out to be critical for making agents work in production. (More on this in a future post.)
3. Not everything needs an LLM
We built another agent for compliance monitoring that syncs with a governance platform. The first version used an LLM to orchestrate the sync. Then we built a direct sync path that bypasses the LLM entirely — and it runs faster, costs less, and produces identical results.
The lesson: LLMs are great at judgment calls (is this resource really idle, or just quiet?), prioritization (which findings matter most?), and natural language (writing recommendations). They’re wasteful for data fetching, transformation, and CRUD operations. Knowing when to remove the AI from your AI agent is an underrated skill. (Future post.)
4. Where should intelligence live?
Some capabilities belong in the agent: domain logic, tool orchestration, context management. Others belong in the infrastructure layer: LLM routing, API key management, cost tracking, rate limiting, PII detection. We learned this the hard way by initially building capabilities in the wrong layer and then migrating them.
We now route all LLM calls through a centralized gateway with per-agent API keys, which gives us cost attribution across agents without any agent-level code for tracking spend. The agent just calls the model; the infrastructure handles the rest. (We’ll cover the full agent-vs-middleware framework in a future post.)
The Humbling Part
After building all of this, we ran the agent, reviewed the findings, and celebrated. It was finding idle resources, flagging waste, generating actionable recommendations.
Then we did the math.
The agent was identifying a few thousand dollars per month in savings. Against a six-figure monthly AWS bill, that’s a 2.4% catch rate.
The agent worked. It just wasn’t working hard enough. The gap between “agent runs successfully” and “agent finds meaningful savings” turned out to be enormous. What followed was a systematic gap analysis that reshaped the entire system — which is the subject of the next post.
What’s Coming
This series will cover the design decisions, architectural trade-offs, and lessons learned from building these agents. Each post focuses on one specific challenge and how we solved it (or didn’t):
- Next up: The gap analysis — what the agent was missing and what we changed
- Two architectural patterns: LLM-orchestrated vs. two-phase discovery, and why we use both
- Reliability in production: Auto-save, stable IDs, and respecting human decisions
- Not everything needs an LLM: When to remove AI from your AI agent
- Agent vs. middleware: A framework for deciding where intelligence should live
If you’re building agents for operational automation — cost optimization, compliance, security, infrastructure management — the problems we hit are the same ones you’ll hit. Hopefully our mistakes save you some time.
Agent Router Enterprise provides the infrastructure layer we use to manage these agents in production: centralized LLM routing with per-agent cost attribution through the LLM Gateway, governed tool connectivity through the MCP Gateway, and continuous supervision through AI Guardrails. When your agent portfolio grows beyond one or two experiments, the infrastructure matters. Learn more here ›