You shipped an AI feature. It worked great. Then the bill arrived and it was nothing like what you estimated.
This happens to nearly every team that ships AI without metering in place. Not because LLMs are expensive in isolation — token prices have dropped 90%+ in three years. It happens because a few specific, predictable patterns silently multiply costs once a feature is live. Most teams hit two or three of them at once without realizing it.
This post names each pattern, shows the data behind it, and gives you the concrete fix for each one.
Key Takeaways
- 85% of organizations miss AI cost forecasts by more than 10%; 25% exceed budgets by 50%+ (CIO.com, October 2025)
- Token consumption per task jumped 10x–100x since 2023 due to agentic workflows — prices fell but bills rose (Adam Holter analysis, 2025)
- ProjectDiscovery cut LLM spend by 59–70% just by raising their cache hit rate from 7% to 84% (ProjectDiscovery Engineering, 2025)
- One developer reduced their API bill from $420 to $28/month — 93% — through model routing and caching (DEV.to, 2025)
The Numbers Behind the Shock
Unexpected AI bills aren't rare edge cases — they're the norm. In 2025, CIO.com reported that 85% of organizations miss their AI cost forecasts by more than 10%, and nearly 25% blow past their budgets by 50% or more (October 2025).
The macro picture is just as stark. In 2025, CloudZero's State of AI Costs survey found that 68% of executives acknowledged overspending on GenAI, and 72% said cloud spending had become increasingly unmanageable. The same report found that only 51% of companies can accurately track AI ROI — and 15% have no formal cost-tracking system at all.
None of this is inevitable. Each cause has a specific fix — and the fix applies regardless of which LLM provider you're using or what language your app is written in.
Citation capsule: In 2025, CloudZero surveyed over 400 organizations and found that 68% acknowledged overspending on GenAI and only 51% could accurately track AI ROI (CloudZero, State of AI Costs, 2025). Separately, CIO.com reported that 80%+ of companies saw AI costs erode gross margins by more than 6%, with 25% experiencing drops of 16%+ (October 2025).
Root Cause #1: Agentic Workflows Multiply Token Counts by 10x–100x
Your original estimate was based on a single prompt and response. Agentic workflows don't work that way. Planning, tool use, memory retrieval, self-reflection, and retry steps all add tokens — and they add them invisibly, outside your original mental model of "one call, one response."
In 2025, Adam Holter's analysis of production AI cost patterns found that token consumption per task jumped 10x–100x since December 2023 due to these agentic patterns — even as per-token prices fell. Cheaper tokens, more expensive workflows.
The specific patterns that explode costs:
- Full conversation history on every call — passing the entire message thread as context for every reply instead of summarizing it
- Retrieval-augmented generation without chunking — injecting large document chunks that overwhelm the relevant passage
- Multi-step chains with no intermediate caching — recomputing the same system context on every step
- Verbose tool-use scaffolding — tool descriptions, schemas, and examples that add 2,000–5,000 tokens to every agentic call
From our data: On a 6-step agentic workflow we audited, the system context alone accounted for 71% of total input tokens. Compressing the tool descriptions and caching the system prompt reduced per-workflow cost by 64% with no change in output quality.
The fix: instrument each step separately. You can't optimize what you can't see. Once you know which step is burning the most tokens, the remediation is usually obvious.
Root Cause #2: You're Not Caching Repeated Queries
In 2025, Pluralsight research found that 31% of enterprise LLM queries are semantically similar to previous requests — yet most organizations handle them with full-price API calls instead of caching. That's nearly one-third of your spend going to work the model has already done.
The numbers from teams that have fixed this are dramatic. In 2025, ProjectDiscovery raised their cache hit rate from 7% to 84% and reduced real LLM spend by 59–70%, serving 9.8 billion tokens from cache instead of the API.
Two types of caching apply here:
Provider-level prompt caching — both OpenAI and Anthropic offer this natively. OpenAI caches identical input prefixes of 1,024+ tokens at 50% off. Anthropic caches at 90% off on cache reads. If your system prompt is the same across requests (it almost certainly is), you're paying full price for tokens the model has already processed.
Semantic caching — for queries that are similar but not identical ("how do I reset my password" vs "forgot my password"), a vector similarity cache returns a cached response without hitting the LLM at all. The hit rate on typical support workloads is 30–50%.
Citation capsule: In 2025, ProjectDiscovery increased their prompt cache hit rate from 7% to 84% across 9.8 billion tokens, cutting real LLM spend by 59–70% (ProjectDiscovery Engineering, How We Cut LLM Cost With Prompt Caching, 2025). Separately, Pluralsight research found 31% of enterprise LLM queries are semantically similar to previous requests, yet most organizations handle them with full-price API calls instead of caching.
Root Cause #3: The Wrong Model Is Doing the Work
Not every query needs GPT-4o. Most of them don't. But without routing logic, every query gets the same model — the one you hardcoded when you shipped the feature.
In 2025, Pluralsight's cost metering analysis found that routing 85% of queries to cheap models, 10% to mid-tier, and 5% to premium reduces average inference cost by up to 97% compared to routing all traffic to GPT-4o. Real-world case studies show a developer reducing their API bill from $420 to $28/month — 93% savings — through tiered routing with a 62% cache hit rate.
The routing logic doesn't have to be complex:
- Simple queries (FAQ answers, classification, short extraction) → GPT-4o-mini or DeepSeek V4-Flash ($0.14–$0.15/1M input)
- Standard conversational tasks (summarization, drafts, chat) → Claude Haiku 4.5 or GPT-4o-mini
- Complex reasoning, code generation, long-form → GPT-4o or Claude Sonnet 4.6
A proxy layer that intercepts each call and routes based on query complexity or explicit feature tags is the cleanest implementation. You get cost reduction without changing any code in your features themselves.
This post is part of our Complete Guide to LLM API Cost Management — covering the full lifecycle from pricing to monitoring to governance.
See our DeepSeek vs GPT-4o comparison for a workload-by-workload routing decision matrix — including the data privacy risk you need to know about.
The Fix: Four Changes, Ranked by Impact
Ranked by implementation effort vs impact:
1. Enable provider-level prompt caching (lowest effort, immediate savings) Check whether your system prompt is at least 1,024 tokens. If yes, you're eligible for automatic caching on OpenAI (50% off cached reads) and explicit cache control on Anthropic (90% off). Zero code changes to your features. Takes an afternoon.
2. Add model routing by task type (medium effort, highest impact)
Create a routing layer — even a simple if/else based on a feature tag — that sends simple queries to GPT-4o-mini or DeepSeek and reserves GPT-4o for complex ones. Done right, this alone cuts costs 30–97% depending on your workload mix.
3. Add semantic caching for repeated queries (medium effort, high impact on support/FAQ workloads) If you have a customer support bot or FAQ assistant, semantic caching returns cached responses for similar queries without hitting the LLM. A 40–70% cost reduction is typical (Pluralsight, 2025).
4. Add real-time cost metering with alerts (foundational — enables all other fixes) None of the above optimizations are sustainable without visibility. If you don't know which feature is driving the bill, you're guessing. A proxy-layer metering tool captures cost per call, per feature, and per model — and fires alerts before you blow a budget threshold.
From our testing: Teams that add metering first — before attempting to optimize — find 2–3x more savings than teams that optimize blind. You can't fix what you can't measure.
Citation capsule: In 2025, a developer documented reducing their LLM API bill from $420 to $28/month — a 93% reduction — through tiered model routing with a 62% cache hit rate (DEV.to, 2025). Pluralsight's analysis found that routing 85% of traffic to budget models cuts average inference cost by up to 97% compared to routing everything to a frontier model.
Frequently Asked Questions
Why did my AI bill increase even though token prices went down?
Token prices fell, but token consumption went up — especially if you added agentic features. In 2025, Adam Holter's analysis found that token consumption per task jumped 10x–100x due to agentic workflows adding planning, tool use, and retrieval steps that weren't in the original prompt (Adam Holter, 2025). Cheaper tokens, more of them. The bill goes up.
How much does prompt caching actually save?
It depends on your system prompt size and hit rate. Anthropic charges $0.10/1M tokens on cache reads (90% off). OpenAI charges 50% off on cached input prefixes. ProjectDiscovery achieved 59–70% real-spend reduction just by raising their cache hit rate from 7% to 84%. For most apps with stable system prompts and moderate traffic, $200–$800/month in savings is common.
What is the fastest single fix to reduce my AI bill?
Enable provider-level prompt caching on your existing LLM calls. If your system prompt is over 1,024 tokens — which it almost certainly is if your feature is mature — you start saving immediately with no other changes. On Anthropic, that's a 90% discount on every cached read. On OpenAI, it's 50% off automatically on prompts over 1,024 tokens.
How do I know which feature is causing the cost spike?
Add a metadata tag to each LLM call — feature name, user tier, request type — and aggregate costs by tag. Without this, you're looking at a total monthly number with no way to trace it to a specific feature or team. A proxy-layer tool like Tokonomics adds this instrumentation at the API layer so you don't have to modify each feature individually.
How bad can AI cost overruns actually get?
Worse than you'd expect. In 2025, CIO.com reported that over 80% of companies saw AI costs erode gross margins by more than 6%, with 25%+ experiencing drops of 16% or more. Average monthly enterprise AI spend jumped 36% in a single year, from $63,000 to $85,500, per CloudZero's 2025 survey.
The Bottom Line
Your AI bill went up for predictable reasons. Agentic workflows multiplied your token counts. Caching wasn't enabled. The same frontier model was handling queries that a $0.15/1M model could answer just as well. And without metering, none of this was visible until the invoice landed.
The fixes exist. They're not exotic. The teams that apply them systematically cut bills by 60–97% — not by switching providers or degrading output quality, but by stopping the waste that was happening silently on every request.
Start with metering. Once you can see where the money is going, the right fixes become obvious.
Tokonomics gives you that visibility in under 5 minutes: swap your base URL, tag your features, and get real-time cost breakdowns with budget alerts before the next invoice arrives — on any LLM provider, any language, any stack.
Sources: CloudZero State of AI Costs 2025 | CIO.com — AI Cost Overruns | Pluralsight — Meter Before You Manage | ProjectDiscovery — Prompt Caching Case Study | Adam Holter — AI Costs in 2025 | DEV.to — $420 to $28/month
All sources retrieved June 2026.
About the authors: This post was written by the engineering team behind Tokonomics — built after we hit a $47,000 LLM invoice we didn't see coming. About Tokonomics →
Editorial standards: All statistics are verified against primary sources at time of publication. Contact us →