Is Your AI Spend Producing Anything? A Manager's Guide
Your AI budget doubled. Someone is going to ask you whether it's producing anything. Most managers don't have a great answer. Here are the two levers you actually have, and a three-question Monday-morning diagnostic.
Your AI budget has roughly doubled in the last six months. Your CFO has noticed. Sometime in the next quarter, someone is going to ask you whether that spend is producing anything. If your honest answer is “I think so?” — you have company. Most engineering managers don’t have a better one yet.
Tetrate Agent Router Enterprise provides continuous runtime governance for GenAI systems. Enforce policies, control costs, and maintain compliance at the infrastructure layer — without touching application code.
Key takeaways
- AI ROI decomposes into two levers: efficiency (is each dollar buying as much capability as it should?) and effectiveness (is the capability producing output?).
- The first audit to run is tokens-per-dollar by agent. Our own range across top keys was about 20x. Some of that gap is genuine; some of it is “we picked the most expensive model six months ago and never revisited.”
- Effectiveness is best measured via variance, not absolutes. Two engineers on similar work with very different spend is the signal worth investigating.
- A weekly ten-minute diagnostic — top keys, model fit, user variance — catches the obvious problems before they reach the budget review.
The good news is that the question decomposes cleanly into two levers, and both are manageable with discipline you already apply to other engineering decisions. The bad news is that almost no cost dashboard makes either lever easy to pull.
The two levers are efficiency — is each dollar buying as much capability as it should be? — and effectiveness — is the capability actually producing output? They are different questions, they have different answers, and you need to look at them separately.
Here’s how to do that on your own team.
Lever 1: AI cost efficiency. Are you on the right model?
Almost every team running AI in production is overspending on model selection. Some workloads need the most capable model available; most don’t. The gap between those two states is where most easy savings live.
The diagnostic to run first is tokens per dollar by agent. It’s a crude proxy for “what model is this workload actually using, and how well does it cache?” — but it lights up the differences fast.
In our own data last week, across about thirty engineers running agents through Agent Router, the tokens-per-dollar range across our top API keys was about 20x. Our heaviest CI workload was getting around 3.5 million tokens per dollar — Sonnet with aggressive prompt caching. Our reasoning-heavy TARS agent was getting around 200,000 tokens per dollar. Same gateway, same product, same week, twentyfold difference in cost per token.
[SCREENSHOT: Tokens-per-dollar by agent across top keys, redacted]
Some of that 20x gap is genuine. TARS does reasoning work that doesn’t cache well; CI does bulk read-and-summarize that caches beautifully. But some of it isn’t. Some of it is “we picked Opus when we built this workload six months ago and nobody has gone back to ask whether Sonnet would be fine.”
We had exactly this situation with our alert-triage agent. It ran on Opus by default because the original prototype needed the reasoning headroom. When we revisited the decision against a specific quality bar — does the triage produce the right severity rating? — we found we could move to a model roughly 15x cheaper with no measurable drop in output quality. Fifteen times cheaper, same answer. The audit took an afternoon.
The practical version, for your team:
- List your top three to five workloads by spend. Almost certainly half your AI budget is in the top three keys. Focus there.
- For each one, write down the quality bar it actually has to clear. “Catches all severity-1 alerts” is a quality bar. “Produces good output” is not.
- Run the workload through a cheaper model on the same inputs. Score the outputs against the quality bar.
- If the cheaper model passes, move the workload. If it doesn’t, you now know why the expensive model is justified — which is itself useful next time the CFO asks.
This is not glamorous work. It’s the AI equivalent of right-sizing your cloud instances, and it has the same shape: a small effort that produces a ten-to-twentyfold cost reduction on the workloads where it pays off.
Lever 2: AI effectiveness. Is the spend producing output?
The harder question — the one most managers genuinely don’t know how to answer — is whether the AI usage their team is generating is producing proportional output.
Early in agent adoption, the metric you care about is usage. Are people trying it? Is it part of the default workflow? In that phase, a developer spending $800 on coding-agent tokens in a week is a good sign — they’re engaged.
A year later, the question flips. The same $800 has a different meaning. Is it producing five PRs and three completed features, or is it producing two PRs and a lot of “let me try that again with a different prompt”? Is the developer reviewing what the agent produces, or are they running the agent because not running it would look bad in the AI-adoption metrics their VP is tracking?
You will not get a clean answer to this. There are no perfect ROI metrics for AI-assisted engineering work. There are weak signals that, used carefully, point in the right direction:
- Cost per merged PR. Imperfect — a PR isn’t a fixed unit of output — but the variance across developers is informative.
- Cost per closed ticket. Same logic. Works better in support-style workloads than in feature work.
- Tokens per session. A developer whose sessions consistently burn 10x the tokens of their peers is either solving harder problems or stuck in a loop with their agent. Either way it’s a conversation to have.
The single most useful frame here is variance, not absolutes. Don’t try to set a target cost-per-PR; the number is too noisy and the activity too varied. Instead, look at the distribution across your team. Two engineers doing similar work, one spending 3x what the other spends — that’s worth thirty minutes of attention. It might mean one is being more ambitious. It might mean one is stuck. The cost dashboard won’t tell you which. It just has to make the variance visible enough that you can ask the question.
[SCREENSHOT: Spend variance across users on similar workload, redacted]
The Monday-morning AI cost diagnostic
If you have ten minutes a week to look at your AI cost dashboard, spend it on these three questions:
- Where is most of the money going? Look at the top three to five keys. Together they probably account for most of your spend — in our case, four keys account for around 50% of last week’s total, ten keys for around 80%. If that concentration shifts week-over-week, something has changed. Find out what.
- Is the top workload on the right model? Tokens-per-dollar should sit in a band you expect for the type of work. If it’s anomalously low for the workload type, you have a model-selection question to ask.
- Is there a user variance worth a conversation? Find the spread between your heaviest and lightest users on similar work. If it’s beyond about 3x, ask why. Don’t assume the answer — ask.
That’s the whole discipline. Ten minutes, three questions, every week. It will not turn AI cost management into a precise science — nothing will, yet — but it will catch the obvious problems before they show up in the quarterly budget review.
Tetrate believes the cost dashboard for AI tooling should answer manager-level questions, not just produce invoices. Tetrate Agent Router Enterprise routes traffic from many agents through a single governed boundary, captures attribution at the point of key issuance, and surfaces the efficiency and variance signals managers actually need to make decisions. Built on the battle-hardened Envoy AI Gateway. If you’re trying to figure out whether your AI spend is producing anything for your own team, let’s talk.
Frequently asked questions
How do I measure ROI on AI spending?
There’s no clean ROI formula for AI-assisted engineering work yet — the activity is too varied and the output too qualitative. The most practical approach is to separate the question into two levers: efficiency (is each dollar buying as much capability as possible?) and effectiveness (is that capability producing output?). Efficiency is measurable via tokens-per-dollar and periodic model-selection audits. Effectiveness is best measured via proxy metrics — cost per merged PR, cost per closed ticket — interpreted as variance signals rather than absolute targets.
When is a cheaper model good enough for my use case?
A cheaper model is good enough when it meets the specific quality bar your workload actually requires. “Good enough” is workload-dependent: an alert-triage agent might be fine on a model 15x cheaper than the one it started on; a reasoning-heavy research agent might genuinely need the most capable model available. The audit is straightforward — define the quality bar, run the workload through cheaper models on the same inputs, score against the bar, switch if it passes. Most workloads have headroom. The only way to know which is to test.
How do I tell if developers are using AI productively?
You can’t directly. What you can do is surface variance: two developers on similar work with very different spend is a signal worth investigating. Use proxy metrics like cost per merged PR or cost per closed ticket as variance detectors, not as targets. The goal is not to grade people on cost efficiency; it’s to spot the cases where someone is burning tokens without making progress and to have a conversation before it becomes a pattern.
What metrics should I track for AI agent ROI?
At a minimum: total spend by agent (efficiency anchor), tokens-per-dollar by agent (model-selection signal), spend by user on similar workloads (variance / productivity signal), and at least one workload-output proxy (cost per PR for coding agents, cost per ticket for support agents, etc.). The exact metrics matter less than the discipline of looking at them weekly and asking the same three questions every time.
How often should we re-audit our model selection?
Quarterly is a reasonable cadence for high-spend workloads, and any time a new model in the relevant family is released. The cost of running the audit is low — an afternoon of evaluating outputs against a quality bar — and the savings are typically large. Don’t treat the original model choice as load-bearing; it was a best guess based on what was available when the workload was built.