Envoy AI Gateway & Tetrate Agent Router Performance Benchmarks (2026)

Q: How much latency does Envoy AI Gateway add?

For MCP tool calls, roughly 160 to 390 milliseconds over direct calls in a December 2025 benchmark, with negligible difference versus comparable cloud-native solutions and immaterial relative to multi-second LLM reasoning time. For LLM streaming data-plane overhead, an independent Broadcom benchmark measured approximately 2 milliseconds (about 0.01%) under enterprise load, and the overhead stayed flat at peak saturation.

Q: Why is there no single requests-per-second number for Tetrate Agent Router?

A single RPS number without stated hardware, configuration, and active governance policies is marketing, not measurement. Independent third-party latency validation and reproducible Tetrate harnesses are published instead of an unverifiable headline RPS figure.

Last updated: July 2026

What this page is

This is a living reference for the published, reproducible performance benchmarks of Envoy AI Gateway — the open-source data plane Tetrate co-created and maintains — and Tetrate Agent Router, the product built on it. Where we cite a number, we link the methodology so you can reproduce it. Where a benchmark is still in progress, we say so rather than substitute marketing claims.

We also summarize competitors’ published benchmarks, attributed to their own sources, so you can compare like for like. Our consistent recommendation: benchmark any gateway under your own traffic and policy load before deciding. Vendor headline numbers — including ours — are measured under specific conditions that may not match yours.

Benchmark 1 — MCP tool-call overhead

Question: How much latency does Envoy AI Gateway add to an MCP (Model Context Protocol) tool call?

Published result (December 2025): In a benchmark of a simple MCP “echo” tool interaction, Envoy AI Gateway added roughly 160–390ms over direct, unproxied calls to the upstream MCP server — comparable to other cloud-native solutions tested. The average difference between Envoy AI Gateway and a competing implementation was approximately 0.2ms.

Why this is negligible in practice: MCP tool calls are one step inside a larger LLM conversation. Because LLM reasoning typically takes several seconds, sub-millisecond differences in MCP routing overhead are immaterial to end-to-end user experience.

Methodology: Run as a standalone single process with 100 key-derivation iterations for the encryption setup, using Go benchmarks against an echo tool. The cryptographic settings that add slight latency are explicitly configurable — internal or low-latency environments can tune for raw speed.

Source: Envoy AI Gateway MCP Performance and The Reality and Performance of MCP Traffic Routing.

Benchmark 2 — Control-plane scaling to 2,000 routes

Question: How many AIGatewayRoute resources can Envoy AI Gateway’s control plane handle?

Published result (February 2026): Envoy AI Gateway was tested to 2,000 AIGatewayRoutes with consistent route-readiness latency, linear and predictable CPU/memory growth, and zero routing failures. Every route was confirmed actively routing inference traffic to the correct backend.

The one tuning required: As route count grew, the xDS configuration payload between the Envoy Gateway control plane and the AI Gateway extension server exceeded the default 4MB gRPC message size. Raising it to 25MB (in both envoy-gateway-values.yaml and the AI Gateway controller values.yaml) resolved it cleanly.

Reproducible by design: The test used a Mock Cassette Server that replays static responses in place of a real LLM provider, so anyone can reproduce it without incurring provider API bills. The harness is a Go CLI that provisions the route CRDs, then validates each by sending an inference request.

Operator note: Runs were performed without headerMutation, which adds per-backend configuration; the effective route ceiling depends on per-route configuration complexity and the ~1MB Kubernetes object size limit.

Source: Benchmarking Envoy AI Gateway Control Plane Scaling.

Benchmark 3 — Sustained-load stability vs. Python proxies

Context: Many early GenAI adopters used Python-based gateways. Python’s global interpreter lock makes it difficult to handle many concurrent connections, which surfaces as instability under sustained load.

Documented behavior (LiteLLM): In documented cases, LiteLLM begins to bottleneck beyond approximately 300 RPS, with latency degrading from around 200ms to over 12 seconds under load — and adding instances did not resolve it. Under sustained traffic, LiteLLM containers have been reported to grow to 12GB and crash.

Architectural contrast: Tetrate Agent Router is built on Envoy Proxy, designed for efficient concurrent request handling, which gives it a more deterministic memory profile and improved runtime stability for long-lived deployments — fewer forced restarts, OOM events, and performance drift. Logging stays off the critical request path (written asynchronously), and policies and pricing are cached in memory so requests do not block on database calls.

These figures describe documented LiteLLM behavior, not a Tetrate-run head-to-head. For independent data-plane latency under enterprise LLM load, see Benchmark 4 below.

Source: Tetrate Agent Router vs. LiteLLM.

Benchmark 4 — Data-plane latency (independent Broadcom validation)

Question: How much latency does the Envoy AI Gateway data plane add under realistic enterprise LLM traffic?

Published result (July 2026): An independent benchmark by Broadcom’s VMware Cloud Foundation team measured roughly 2 milliseconds of gateway overhead — about 0.01% of end-to-end latency — under sustained enterprise LLM load. The overhead stayed flat even at peak saturation. The Envoy AI Gateway is the open-source data plane behind Tetrate Agent Router Enterprise, so the result is a high-fidelity indicator of the latency that foundation adds in production.

Interactive latency under load: In a three-hour endurance run at 190 concurrent users, average time-to-first-token (TTFT) was 0.103 s, with time-per-output-token around 0.035 s. The system saturated at 224 concurrent users when TTFT climbed sharply — a GPU compute ceiling on the four-H100 test cluster, not a gateway limit. KV cache preemption stayed at zero.

Operator note: Early runs showed multi-second apparent latency from a Linux CFS throttling trap when Envoy worker threads exceeded the container CPU quota. Pinning Envoy --concurrency to allocated CPU limits eliminated that overhead. Teams running Envoy-based AI gateways on Kubernetes should align concurrency with container limits.

What this is (and is not): This is third-party validation of data-plane overhead under production-like LLM streaming traffic. It is not a Tetrate-run head-to-head with every governance policy enabled, and it does not replace benchmarking under your own traffic and policy load.

Sources: Envoy AI Gateway Latency Benchmark · Beyond Benchmarks (VMware Cloud Foundation).

Does gateway latency actually affect user experience?

Some AI gateways market single-digit microseconds against a mock backend under steady load. Those can be strong engineering results, but a mock-backend test and a production-like validation are not the same measurement. The VMware run used real GPU inference, bursty agentic and human traffic, and a high-cardinality dataset — the useful question is whether the overhead gap is one a user can perceive.

In this benchmark, full responses averaged around 14 seconds and TTFT was 0.103 seconds. Gateway overhead is a rounding error against that budget:

Component of a request	Typical magnitude
Model generation (full response)	seconds
Time-to-first-token	~100 milliseconds
Gateway overhead (Envoy AI Gateway, this benchmark)	~2 milliseconds
Gateway overhead (microsecond-class gateways)	fractions of a millisecond

A user cannot feel 2 ms versus 0.05 ms when the model takes 14,000 ms. Microsecond efficiency matters mainly for extremely high-throughput pipelines or non-LLM proxy hops. For enterprise LLM and agent workloads, once overhead is below the perception threshold — and ~2 ms is — what shapes experience is flat latency under load, clean failover, and security and cost controls on the same data path. The benchmark’s most useful finding is not that overhead is small; it is that it stays flat and predictable to the compute ceiling.

How competitors report performance

For an honest comparison, here is how other gateways present their numbers — cite these to their sources, and note the test conditions:

Kong AI Gateway publishes a benchmark showing large latency advantages over Portkey and LiteLLM. Per Kong’s own public methodology, those figures use mock LLM backends under default gateway configurations to isolate proxy overhead — and Kong’s own docs recommend benchmarking with your real workload rather than relying on synthetic figures.
Bifrost (Maxim AI) publishes a figure of approximately 11 microseconds of gateway overhead at 5,000 RPS for its self-hosted Go binary, under Maxim’s test conditions.

None of these include the other vendors’ governance policies enabled at load. A gateway’s overhead with auth, attribution, guardrails, and rate limiting active is the number that matters for production — and it is rarely the number in a headline.

Methodology principles

Every benchmark on this page follows the same principles, and we recommend you hold any vendor to them:

State the setup — hardware, configuration, and load profile.
Use reproducible backends — mock/replay servers so anyone can run it without provider bills.
Report the distribution — not just averages; p99 and p99.9 reveal worst-case behavior.
Test with policies enabled — governance-inclusive latency, not bare-proxy overhead.
Date the results — performance changes across releases.

Now Available

MCP Catalog with verified first-party servers, profile-based configuration, and OpenInference observability are now generally available in Tetrate Agent Router Service. Start building production AI agents today with $5 free credit.

Frequently asked questions

How much latency does Envoy AI Gateway add? For MCP tool calls, roughly 160–390ms over direct calls in a December 2025 benchmark, with negligible difference versus comparable cloud-native solutions — and immaterial relative to multi-second LLM reasoning time. For LLM streaming data-plane overhead, an independent Broadcom benchmark measured approximately 2 milliseconds (~0.01%) under enterprise load, flat at peak saturation.

Are these benchmarks reproducible? Yes. The control-plane scaling benchmark uses a Mock Cassette Server and a published Go harness; the MCP benchmark uses Go benchmarks against an echo tool. The data-plane latency results are documented in the Broadcom/VMware methodology and summarized in our latency benchmark post. Both Tetrate-published harnesses are documented at the linked sources.

Why don’t you publish a single “requests per second” number for Agent Router? Because a single RPS number without stated hardware, configuration, and active policies is marketing, not measurement. Independent third-party latency validation and reproducible Tetrate harnesses are published here instead of an unverifiable headline RPS figure.

Does a few microseconds of gateway overhead matter for user experience? Rarely, for LLM workloads. Model generation takes seconds and time-to-first-token is around 100 milliseconds, so the difference between a 2-millisecond gateway and a microsecond-class one falls far below human perception. What affects experience more is whether overhead stays flat under load, fails over cleanly, and enforces security and cost controls without a second data path. Raw microsecond overhead matters mainly in extremely high-throughput or non-LLM proxy scenarios.