Making Agents Reliable: Auto-Save, Stable IDs, and the Context Window Problem
When your agent crashes at tool call 142 out of 150, you'd better hope the first 141 findings aren't lost. Here are the patterns that made our cost agents production-ready.
There’s a specific kind of frustration reserved for watching an agent run for 45 minutes, analyze 18 AWS accounts, discover dozens of cost optimization opportunities, and then crash on account 19 because it hit the context window limit. All the findings from the first 18 accounts? Gone. Stored in the conversation history that just got truncated.
This happened to us more than once before we got serious about reliability.
Tetrate Agent Router Enterprise provides continuous runtime governance for GenAI systems. Enforce policies, control costs, and maintain compliance at the infrastructure layer — without touching application code.
Most agent tutorials focus on the exciting parts: tool design, prompt engineering, choosing the right model. The boring-but-critical parts, like what happens when the agent fails mid-run, how to prevent duplicate findings across weekly scans, and how to respect human decisions made between runs, rarely get mentioned. These are the patterns that actually determine whether your agent is a demo or a production system.
The Context Window Is a Ticking Clock
Our AWS cost agent analyzes 20+ accounts, each requiring multiple tool calls. List the accounts, pull billing data, get EC2 utilization, check EBS volumes, analyze NAT gateways, look at load balancers. For a thorough analysis, each account might generate 6-8 tool calls, each returning structured data about resources, costs, and utilization metrics.
The problem is that every tool call result stays in the conversation. The LLM needs this context to make informed decisions about later accounts (it might notice patterns across accounts, or learn from access errors to adjust its approach). But the context window has a hard limit, and tool call results are verbose. By account 15, most of the input tokens are being spent on findings from accounts the agent has already finished analyzing.
This creates a race condition. The agent is racing to finish its work before the conversation gets so large that either: (a) it hits the context window limit and stops, or (b) the cost of each subsequent tool call becomes absurd because the LLM is processing the entire history. We set a request limit of 150 tool calls, which is a practical ceiling, not a theoretical one.
The fix is conceptually simple: don’t wait until the end to save your work.
Auto-Save: Findings as They’re Found
Instead of having the agent collect all findings in memory and then persist them in a batch at the end (the natural pattern when you’re writing a script), we built auto-save directly into the tool layer. Every tool that discovers a finding saves it to Firestore immediately, before returning the result to the LLM.
The implementation lives in a helper function that the analysis tools call. When the EC2 analysis tool finds an idle instance, it doesn’t just return a message saying “found idle instance, $X/month.” It creates a CostFinding object with full details, saves it to Firestore, adds it to the session’s finding list, and then returns a summary to the LLM.
This means that if the agent crashes at tool call 142, the findings from tools 1 through 141 are already persisted. The agent’s session state is transient, but the findings are durable.
The pattern sounds obvious in retrospect, but it changes how you think about agent design. The tools aren’t just providing information to the LLM; they’re performing side effects (database writes) as part of their execution. The LLM doesn’t need to “decide” to save findings or “remember” to persist them at the end. Saving happens as a natural consequence of analysis.
This also decouples the agent’s success criteria from its completion. A partial run that analyzes 15 out of 20 accounts still produces 15 accounts’ worth of findings. Next week’s run picks up where this one left off (more on that below). You don’t get the summary message at the end, but you get the actual value: findings in the database.
Stable IDs: The Same Resource Should Always Be the Same Finding
Here’s a problem that doesn’t show up in demos but shows up on the first weekly run: duplicate findings.
If your agent analyzes the same AWS account two weeks in a row and finds the same idle NAT gateway both times, you need the second run to update the existing finding, not create a new one. Otherwise your dashboard fills up with duplicates, and anyone who dismissed a finding last week has to dismiss it again.
The solution is deterministic IDs. Each finding gets an ID that’s derived from its identity, not from when it was created:
SHA-256(cloud_provider + ":" + account_id + ":" + resource_id)[:32]
An idle NAT gateway in a particular AWS account always generates the same 32-character ID, regardless of when the agent runs or which model version it uses. This ID becomes the Firestore document ID, so a save() on an existing finding is an update, not a create.
The implementation uses a Pydantic model validator that fires automatically when a CostFinding is instantiated. If the ID looks like a random UUID (the default from the base class), it gets replaced with the stable hash. If the ID is already stable (e.g., loaded from Firestore), it’s left alone.
This is a small detail with outsized impact. Without stable IDs, every agent run creates a fresh set of findings, and any tracking of finding lifecycle (who reviewed it, whether it was dismissed, whether it’s been resolved) gets lost. With stable IDs, findings have continuity across runs, and the database becomes a living inventory of cost optimization opportunities rather than a pile of disconnected snapshots.
Respecting Human Decisions
Stable IDs solve the duplicate problem, but they introduce a new one: what happens when a human dismisses a finding and then the agent re-discovers it?
Consider this scenario. The agent finds an idle-looking EC2 instance and saves a finding. A human reviews it and marks it as dismissed with the note “standby for DR, leave running.” Next week, the agent runs again, finds the same instance (still idle), and cheerfully overwrites the dismissed finding with a new “active” one. The human’s decision is gone.
This is surprisingly common in agent systems that interact with human workflows. The agent’s view of the world is stateless per run, but the human’s decisions accumulate between runs. If the agent doesn’t respect those accumulated decisions, people stop trusting the agent’s output and stop reviewing findings at all.
Our solution is a save_if_not_dismissed function that checks the existing status before writing:
If the finding already exists with a status of “dismissed” or “resolved,” the save is skipped entirely. The existing finding is returned with a reason of “skipped_dismissed” or “skipped_resolved” for logging.
If the finding exists with a status of “acknowledged” or “in_progress,” the finding’s data is updated (costs might have changed, utilization might be different) but the status is preserved. Someone who’s acknowledged a finding and is working on it shouldn’t have their status reset to “new.”
If the finding doesn’t exist yet, it’s created with status “new.”
The return value is a tuple of (document, reason) so we can log what happened and track how many findings are being re-discovered versus how many are genuinely new. Over time, this gives you a useful signal about whether your environment is getting cleaner (more skipped findings, fewer new ones) or whether waste is accumulating (lots of new findings every week).
Access Errors Are Findings Too
Another reliability pattern we learned the hard way: when the agent can’t access an account, that’s information, not just an error to swallow.
Our AWS agent uses role chaining to access accounts. It starts with an IAM user, assumes an organization-level role, then assumes a member role in each individual account. If any step in that chain fails (role doesn’t exist, permissions insufficient, trust policy misconfigured), the agent can’t analyze that account.
Early versions just logged the error and moved on. The problem was that nobody noticed. An account would lose its cross-account role during a security review, and the agent would silently skip it for weeks until someone wondered why there were no findings for that account.
Now, access errors are tracked as first-class objects. Each failed account gets an AccountAccessError with the account ID, name, error code, and timestamp. These are returned alongside the findings in the agent’s result, displayed in the dashboard, and included in the summary. “Analyzed 18/20 accounts, 2 access errors” is a much more honest summary than “analysis complete.”
We also cache the failures per session. If the agent fails to assume a role into account X during billing analysis, it doesn’t try again when it gets to EC2 analysis for the same account. The failure is deterministic: if role assumption fails once, it’ll fail again, so there’s no point burning API calls on retries within the same run.
The Compound Effect
None of these patterns is individually complex. Auto-save is just “write to the database early.” Stable IDs are just “hash the identity fields.” Status preservation is just “check before you write.” Error tracking is just “collect failures alongside successes.”
But together they transform the agent from something that occasionally produces useful output into something you can run on a schedule and trust. The findings accumulate over time. Human decisions persist. Access problems get surfaced. Partial runs still produce value. The database becomes a reliable, evolving picture of your cost optimization landscape rather than a weekly dump of whatever the agent happened to find.
If you’re building agents that need to operate in production, start with these patterns. They’re less exciting than prompt engineering and model selection, but they’re the difference between an agent that works in a demo and one that works when nobody is watching.
Agent Router Enterprise provides the infrastructure layer for agents that need to work reliably in production. Per-agent cost attribution tracks what each run actually costs, behavioral metrics measure agent quality over time, and drift detection alerts you when your agent’s performance changes. Because reliability isn’t just about uptime, it’s about consistent, trustworthy output. Learn more here ›