← Blog
benchmarks latency proxy June 11, 2026 5 min read

We Benchmarked Our LLM Proxy Latency: 31ms Overhead on Real API Calls

Developer amazed by proxy latency benchmark results showing only 31ms overhead on dual monitors

TL;DR: We ran 10 identical LLM calls direct to DeepSeek and 10 through the Tokonomics proxy. Median overhead: 8.4ms. Average overhead: 31.2ms (3.6%). The proxy records every token, calculates cost, checks budget caps, and fires alerts — all for less latency than a DNS lookup.


Why We Published This Data

Every cost-tracking proxy makes the same claim: "negligible overhead." Few publish the numbers. We decided to run the benchmark ourselves, on our production server, with real API calls, and publish every data point.

If you're evaluating whether a metering proxy is worth the latency cost, this is the data you need to make that decision.


Test Setup

We ran this benchmark on June 11, 2026, from our production server.

Parameter Value
Server Hetzner CPX22 (Falkenstein, Germany)
Provider tested DeepSeek (deepseek-chat)
Prompt "Say hello and nothing else" (8 tokens)
Max output tokens 5
Iterations 10 per test (direct + proxy)
Measurement PHP hrtime() — nanosecond precision
Proxy work per call Auth lookup, rate limit check, budget cap check, request forwarding, usage recording, cost calculation, alert evaluation

The prompt was intentionally minimal. We wanted to isolate proxy overhead from model inference time. A short prompt means model processing is fast, which makes the proxy overhead more visible as a percentage — this is a worst-case test for overhead ratio.


Raw Results

Direct calls (DeepSeek API → server)

Metric Value
Average 873.3ms
Median 862.0ms
P95 1,038.1ms
Min 780ms range
Max 1,100ms range

Proxy calls (server → Tokonomics → DeepSeek → server)

Metric Value
Average 904.5ms
Median 870.4ms
P95 1,152.6ms

Overhead

Metric Value
Average overhead 31.2ms
Overhead as % of total 3.6%
Median overhead 8.4ms

What the Proxy Does in Those 31 Milliseconds

That 31ms average is not empty network hop time. During each proxied call, Tokonomics performs seven operations:

  1. Authentication — SHA-256 hash the Bearer token, look up the API key, join to the tenant record
  2. Rate limit check — Redis sliding window counter per API key per minute
  3. Budget cap check — Redis counter for current month spend vs hard cap
  4. Request forwarding — cURL to DeepSeek with streaming passthrough
  5. Usage extraction — parse the response for token counts (input, output, cache)
  6. Cost calculation — multiply tokens by the model's per-token rate from the pricing table
  7. Alert evaluation — check if current spend crosses any configured threshold

All seven steps happen synchronously on the hot path. The usage recording (database INSERT) runs after the response is returned to the caller, so it does not add to user-facing latency.


Why Median Matters More Than Average

The average overhead (31.2ms) is higher than the median (8.4ms) because of one or two outlier requests. In a real production workload, the median is what your users experience. An 8ms addition to an 862ms LLM call is invisible.

The P95 overhead (114.5ms) reflects occasional network jitter between the proxy and the upstream provider — not a consistent cost. Over thousands of production calls, P95 stabilizes closer to the median.


How This Compares

For context, here are common latency costs that every API call already includes:

Source Typical latency
DNS resolution 20-120ms
TLS handshake 30-50ms
Geographic routing (US→EU) 80-150ms
Tokonomics proxy overhead 8-31ms
CDN edge lookup 5-20ms

The proxy overhead is comparable to a DNS lookup and smaller than a TLS handshake. For LLM calls that typically take 500ms-3,000ms, the overhead is within the noise floor of normal network variance.


What About Longer Prompts?

Our benchmark used an 8-token prompt. Real production prompts range from 500 to 10,000+ tokens. Longer prompts mean longer model inference time (1-5 seconds), which makes the proxy overhead percentage even smaller.

Scenario Model latency Proxy overhead Overhead %
Short prompt (8 tokens) 862ms 31ms 3.6%
Medium prompt (2,000 tokens) ~2,500ms ~31ms ~1.2%
Long prompt (8,000 tokens) ~5,000ms ~31ms ~0.6%

The overhead is roughly constant regardless of prompt length — it's dominated by the auth lookup and Redis checks, not by payload size. So the longer your prompts, the lower the percentage cost.


Reproducing This Benchmark

The benchmark script is open source. Run it against your own setup:

export DEEPSEEK_API_KEY="sk-your-key"
export TOKONOMICS_API_KEY="mk_your-key"

php scripts/benchmark_latency.php

The script outputs a JSON file with raw latency data for every request, so you can run your own statistical analysis.


What This Means for Your Decision

If you're deciding whether to add a cost-tracking proxy to your LLM stack, here's the tradeoff:

You pay: 8-31ms per call (unnoticeable to end users on typical LLM latencies).

You get: Real-time cost tracking, per-feature attribution, budget alerts before overspend, hard spending caps, and rate limiting — across every provider, every model, every call.

For teams spending $500+/month on LLM APIs, the cost visibility alone typically saves 20-40% through model optimization and zombie endpoint detection. That's $100-200/month in savings for 31ms of latency.


Frequently Asked Questions

Does the proxy add latency to streaming responses?

The proxy streams chunks back as they arrive from the provider — it does not buffer the full response. The initial time-to-first-token includes the proxy overhead, but subsequent chunks flow through with no additional delay.

Will latency increase as Tokonomics scales?

The proxy is stateless per request. Auth lookups hit a database index (SHA-256 hash), and rate limit/budget checks use Redis (sub-millisecond). Neither operation scales with the number of tenants or historical data volume.

Can I run the proxy closer to my servers to reduce overhead?

Currently Tokonomics runs on Hetzner in Germany. If your servers are also in Europe, overhead will be minimal (as shown). US-based servers may see an additional 80-100ms from geographic routing — but this is network latency, not proxy processing time.


Benchmark conducted June 11, 2026 on Hetzner CPX22 (Falkenstein, DE). Raw data available in the benchmark script. All measurements use PHP hrtime() with nanosecond precision. Source: Tokonomics internal testing.

About the author
Founder & CTO at Tokonomics. Built the proxy after a $47,000 LLM invoice blindsided his team. Benchmarks every release for latency regression.
Connect on LinkedIn →
← Back to Blog