When you decide to add LLM cost tracking to your app, you have two architectural choices: intercept calls at the HTTP level (proxy layer) or instrument them at the code level (SDK wrapper). Both work. They have different tradeoffs.
This guide gives you the honest comparison so you can choose the right one for your stack.
Bottom line: SDK instrumentation is faster to start and easier to reason about. Proxy-layer works across all languages and services without code changes, and is the only approach that works for multi-service stacks and hard budget enforcement.
This post is part of our LLM Cost Monitoring Tools guide.
The Core Tradeoff
| Dimension | SDK-Based | Proxy-Layer |
|---|---|---|
| Setup time | 30 min (single service) | 1 hour (one-time, covers everything) |
| Language support | Per-SDK (Python, JS, etc.) | Any HTTP client |
| Multi-service coverage | Each service re-implements | Universal — one config |
| Bypass risk | High — new services skip it | Zero — all traffic routed |
| Budget enforcement | No shared state | Redis counters shared across all callers |
| Agentic loop protection | None (per-call only) | Yes — cumulative spend tracked |
| Response content access | Full (in-process) | Streaming-compatible (intercept before forward) |
| Added latency | ~0ms (in-process) | <1ms (HTTP round-trip) |
| Vendor lock-in | Tied to provider SDK | Provider-agnostic |
SDK-Based Tracking: How It Works
You wrap your LLM API calls with a function that records the cost after each response:
def chat_with_tracking(messages, feature, tenant_id):
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)
cost = (
response.usage.prompt_tokens * 0.00000015 +
response.usage.completion_tokens * 0.0000006
)
db.execute(
"INSERT INTO llm_costs (feature, tenant_id, cost) VALUES (?, ?, ?)",
[feature, tenant_id, cost]
)
return response
When SDK tracking is right:
- Single-language, single-service app
- Early stage with one or two features
- You need access to response content for logging (not just tokens)
- Your team is comfortable maintaining per-feature instrumentation
When it breaks down:
- Multiple microservices each need to re-implement the wrapper
- A new service ships without implementing it — you have a blind spot
- You need cross-service budget enforcement (SDK counters can't share state reliably across services)
- You're using multiple LLM providers — different SDKs, different response formats
Proxy-Layer Tracking: How It Works
Every LLM call routes through an HTTP proxy that intercepts the request, logs metadata, checks budgets, then forwards to the provider:
App → Proxy → LLM Provider
↓
Log cost
Check budget
Apply routing
Your app code changes exactly once: update the base URL and add auth headers:
// Before: direct to OpenAI
$url = 'https://api.openai.com/v1/chat/completions';
// After: route through proxy
$url = 'https://api.tokonomics.ca/proxy/openai/chat/completions';
// + Add: Authorization: Bearer mk_your_key
// + Add: X-Feature-Name: support-bot
// + Add: X-Tenant-ID: tenant_abc123
This one change covers every LLM call from that service — past and future.
When proxy tracking is right:
- Multiple services, languages, or teams making LLM calls
- You need hard budget enforcement (proxy can block at the HTTP layer)
- You're building multi-tenant SaaS with per-customer budgets
- You want model routing without changing feature code
- You need coverage guarantees — new services can't accidentally bypass it
The Multi-Service Problem
The SDK approach scales linearly with services: N services → N wrapper implementations. The proxy approach scales to N services for free.
For a SaaS with 5 microservices each making LLM calls, SDK tracking requires:
- 5 separate wrapper implementations
- 5 separate cost databases or API endpoints
- Manual aggregation to get total cost
- Any new service must implement the wrapper to be visible
The proxy approach:
- One proxy configuration covers all 5 services
- All costs aggregate automatically into one dashboard
- Any new service is automatically covered when you add the proxy URL
For most teams above 2 microservices, proxy wins on maintenance alone.
The Budget Enforcement Gap
This is where the approaches fundamentally differ. SDK-based tracking can check a budget before making a call, but it can't enforce it across concurrent callers:
# SDK approach: race condition on concurrent requests
def check_and_call(tenant_id, messages):
current = db.get_monthly_spend(tenant_id)
if current > budget: # Two threads both read "under budget"
raise BudgetExceeded()
# Both threads proceed past this check simultaneously
return call_llm(messages)
A proxy layer with Redis atomic operations has no race condition — the INCR is atomic:
-- Redis Lua: atomic check-and-increment
local current = redis.call('GET', key) or "0"
if tonumber(current) + cost > cap then return "DENY" end
redis.call('INCRBYFLOAT', key, cost)
return "ALLOW"
For hard budget enforcement in multi-tenant SaaS or multi-service systems, proxy is the only reliable approach.
Frequently Asked Questions
Can I start with SDK tracking and migrate to a proxy later?
Yes. Many teams do this. SDK tracking is faster to start; proxy is the right long-term architecture. When you add a proxy later, remove the SDK tracking logic — you don't want double-counting.
Does the proxy approach work for streaming responses?
Yes. A well-implemented proxy streams chunks back to the caller transparently, intercepts the final usage object in the stream (where token counts are reported), and logs the cost after the stream completes. Your users see no difference in response behavior.
What about latency? Does a proxy add meaningful overhead?
Sub-millisecond in practice. A round-trip to a co-located proxy (same region as your app) adds 0.5–2ms. For LLM calls that take 300ms–5s, this is unmeasurable to users. If you're hyper-latency-sensitive, use SDK tracking — but the threshold for "this matters" is very rarely crossed in practice.
About the authors: About → | Contact us →