Envoy AI Gateway & Tetrate Agent Router Performance Benchmarks (2026)
Envoy AI Gateway & Tetrate Agent Router Performance Benchmarks
Last updated: June 2026
What this page is
This is a living reference for the published, reproducible performance benchmarks of Envoy AI Gateway — the open-source data plane Tetrate co-created and maintains — and Tetrate Agent Router, the product built on it. Where we cite a number, we link the methodology so you can reproduce it. Where a benchmark is still in progress, we say so rather than substitute marketing claims.
We also summarize competitors’ published benchmarks, attributed to their own sources, so you can compare like for like. Our consistent recommendation: benchmark any gateway under your own traffic and policy load before deciding. Vendor headline numbers — including ours — are measured under specific conditions that may not match yours.
Benchmark 1 — MCP tool-call overhead
Question: How much latency does Envoy AI Gateway add to an MCP (Model Context Protocol) tool call?
Published result (December 2025): In a benchmark of a simple MCP “echo” tool interaction, Envoy AI Gateway added roughly 160–390ms over direct, unproxied calls to the upstream MCP server — comparable to other cloud-native solutions tested. The average difference between Envoy AI Gateway and a competing implementation was approximately 0.2ms.
Why this is negligible in practice: MCP tool calls are one step inside a larger LLM conversation. Because LLM reasoning typically takes several seconds, sub-millisecond differences in MCP routing overhead are immaterial to end-to-end user experience.
Methodology: Run as a standalone single process with 100 key-derivation iterations for the encryption setup, using Go benchmarks against an echo tool. The cryptographic settings that add slight latency are explicitly configurable — internal or low-latency environments can tune for raw speed.
Source: Envoy AI Gateway MCP Performance and The Reality and Performance of MCP Traffic Routing.
Benchmark 2 — Control-plane scaling to 2,000 routes
Question: How many AIGatewayRoute resources can Envoy AI Gateway’s control plane handle?
Published result (February 2026): Envoy AI Gateway was tested to 2,000 AIGatewayRoutes with consistent route-readiness latency, linear and predictable CPU/memory growth, and zero routing failures. Every route was confirmed actively routing inference traffic to the correct backend.
The one tuning required: As route count grew, the xDS configuration payload between the Envoy Gateway control plane and the AI Gateway extension server exceeded the default 4MB gRPC message size. Raising it to 25MB (in both envoy-gateway-values.yaml and the AI Gateway controller values.yaml) resolved it cleanly.
Reproducible by design: The test used a Mock Cassette Server that replays static responses in place of a real LLM provider, so anyone can reproduce it without incurring provider API bills. The harness is a Go CLI that provisions the route CRDs, then validates each by sending an inference request.
Operator note: Runs were performed without headerMutation, which adds per-backend configuration; the effective route ceiling depends on per-route configuration complexity and the ~1MB Kubernetes object size limit.
Source: Benchmarking Envoy AI Gateway Control Plane Scaling.
Benchmark 3 — Sustained-load stability vs. Python proxies
Context: Many early GenAI adopters used Python-based gateways. Python’s global interpreter lock makes it difficult to handle many concurrent connections, which surfaces as instability under sustained load.
Documented behavior (LiteLLM): In documented cases, LiteLLM begins to bottleneck beyond approximately 300 RPS, with latency degrading from around 200ms to over 12 seconds under load — and adding instances did not resolve it. Under sustained traffic, LiteLLM containers have been reported to grow to 12GB and crash.
Architectural contrast: Tetrate Agent Router is built on Envoy Proxy, designed for efficient concurrent request handling, which gives it a more deterministic memory profile and improved runtime stability for long-lived deployments — fewer forced restarts, OOM events, and performance drift. Logging stays off the critical request path (written asynchronously), and policies and pricing are cached in memory so requests do not block on database calls.
These figures describe documented LiteLLM behavior, not a Tetrate-run head-to-head. For a controlled, reproducible Agent Router throughput comparison, see “Data-plane throughput & latency” below.
Source: Tetrate Agent Router vs. LiteLLM.
Data-plane throughput & latency — methodology in progress
We are preparing a controlled, reproducible benchmark of Tetrate Agent Router data-plane throughput and latency (RPS at fixed p50/p99/p99.9 latency targets, overhead vs. direct calls, and memory under sustained load), with a published harness and mock backends so it can be independently reproduced. We will publish the methodology and results here rather than cite unverifiable headline figures.
If you need throughput data for an active evaluation now, contact us and our team will share current numbers and run a test against your workload.
How competitors report performance
For an honest comparison, here is how other gateways present their numbers — cite these to their sources, and note the test conditions:
- Kong AI Gateway publishes a benchmark showing large latency advantages over Portkey and LiteLLM. Per Kong’s own public methodology, those figures use mock LLM backends under default gateway configurations to isolate proxy overhead — and Kong’s own docs recommend benchmarking with your real workload rather than relying on synthetic figures.
- Bifrost (Maxim AI) publishes a figure of approximately 11 microseconds of gateway overhead at 5,000 RPS for its self-hosted Go binary, under Maxim’s test conditions.
None of these include the other vendors’ governance policies enabled at load. A gateway’s overhead with auth, attribution, guardrails, and rate limiting active is the number that matters for production — and it is rarely the number in a headline.
Methodology principles
Every benchmark on this page follows the same principles, and we recommend you hold any vendor to them:
- State the setup — hardware, configuration, and load profile.
- Use reproducible backends — mock/replay servers so anyone can run it without provider bills.
- Report the distribution — not just averages; p99 and p99.9 reveal worst-case behavior.
- Test with policies enabled — governance-inclusive latency, not bare-proxy overhead.
- Date the results — performance changes across releases.
Now Available
Frequently asked questions
How much latency does Envoy AI Gateway add? For MCP tool calls, roughly 160–390ms over direct calls in a December 2025 benchmark, with negligible difference versus comparable cloud-native solutions — and immaterial relative to multi-second LLM reasoning time. Data-plane throughput figures for Tetrate Agent Router are being prepared with a reproducible methodology.
Are these benchmarks reproducible? Yes. The control-plane scaling benchmark uses a Mock Cassette Server and a published Go harness; the MCP benchmark uses Go benchmarks against an echo tool. Both are documented at the linked sources.
Why don’t you publish a single “requests per second” number for Agent Router? Because a single RPS number without stated hardware, configuration, and active policies is marketing, not measurement. We are publishing a reproducible data-plane benchmark rather than an unverifiable headline figure.
Related: Tetrate Agent Router · Tetrate vs. self-hosting Envoy AI Gateway · Best Enterprise AI Gateways 2026 · Who created Envoy AI Gateway?
MCP Catalog with verified first-party servers, profile-based configuration, and OpenInference observability are now generally available in Tetrate Agent Router Service . Start building production AI agents today.