We Tracked 1M LLM API Calls — 60% Were on the Wrong Model

TL;DR: We analyzed the first million API calls through Tokonomics — 60-70% were simple tasks running on GPT-4o that could have used models costing 18x less. 82% of developers default to OpenAI GPT models (Stack Overflow, 2025). Combining model routing with prompt caching cuts total LLM spend by 80-95%.

Key Takeaways

In 2026, 82% of developers default to OpenAI GPT models (Stack Overflow Developer Survey, 2025), but 60-70% of production API calls are simple tasks that don't need a frontier model.

Switching classification and extraction calls from GPT-4o to DeepSeek V3 saves 18x on input tokens ($2.50 → $0.14 per million).

Combining model routing with prompt caching cuts total LLM spend by 80-95%.

Average monthly AI spend hit $85,500 per company in 2025 — a 36% jump from the year before (CloudZero, 2025).

Here's something that'll bother you if you're shipping AI features right now.

We looked at the first million API calls that came through Tokonomics — across 47 tenants, 9 providers, dozens of models. The pattern was the same almost everywhere: teams default to GPT-4o for everything. Customer support chatbots? GPT-4o. JSON extraction? GPT-4o. Classification into 5 categories? GPT-4o. It's the SELECT * of AI development — it works, nobody questions it, and it's silently bleeding money.

The waste isn't theoretical. It shows up in the billing dashboard every month, and most teams have no idea it's there.

Why Do 82% of Developers Default to GPT-4o?

In 2025, Stack Overflow's annual Developer Survey found that 82% of developers use OpenAI GPT models for their AI work (Stack Overflow, 2025). That makes GPT-4o the de facto standard — the model people paste into their code and never change.

It makes sense. OpenAI has the best docs. Every tutorial uses GPT-4o. When you're prototyping at midnight, you're not running benchmarks across 6 providers. You grab what works.

But here's the problem: prototyping habits become production costs. That model you picked to test a feature in February is still running in production in June, processing 50,000 calls a day, and nobody's asked whether a $0.14/M model would give the same result as a $2.50/M model.

Our finding: When we built Tokonomics, our own internal chatbot ran on GPT-4o for three months before anyone checked. Switching the FAQ portion to GPT-4o-mini cut that component's cost by 94% with no measurable quality difference on our eval set.

This isn't unique to us. In 2026, Divyam.AI coined the term "LLMflation" to describe this exact pattern — teams sticking with legacy model choices long after cheaper alternatives catch up (Divyam.AI, 2026). The inertia is real.

What Does Model Selection Actually Cost You?

Let's stop speaking in percentages and look at real money. Here's what 1 million requests cost across models, assuming an average of 500 input tokens and 200 output tokens per call — a typical production workload for classification, extraction, or short-form generation.

Source: OpenAI, Anthropic, DeepSeek official pricing pages, June 2026

That's a 25x cost difference between GPT-4o and GPT-4.1 Nano. For the same million requests.

Are you sure every one of those calls needs GPT-4o? Because in our data, 60-70% of them don't.

Which API Calls Don't Need a Frontier Model?

In 2026, multiple production operators reported that 60-70% of API calls in typical SaaS apps are simple enough for budget models — classification, short summarization, structured extraction, and routing decisions (Prem AI, 2026). That matches what we've seen through Tokonomics.

Here's how it breaks down in practice:

Send to a budget model ($0.10-$0.80/M input):

Intent classification ("Is this a refund request or a general question?")
JSON/structured data extraction from text
Short summaries (under 200 words)
Sentiment analysis
Content moderation / safety checks
Translation of short strings

Keep on a frontier model ($2.50-$3.00/M input):

Multi-step reasoning chains
Complex code generation and debugging
Long-form content creation where quality is critical
Vision and multimodal tasks
Tasks where you've benchmarked and confirmed the quality gap matters

The uncomfortable truth? Most teams never benchmark. They assume GPT-4o is required because they never tested whether GPT-4o-mini or DeepSeek gives an acceptable result. Field reports from LLM operators suggest that 40-60% of token budgets in production apps are pure waste — spent on frontier models doing budget-model work (EditorialGE, 2026).

How Much Are Companies Actually Spending?

In 2025, CloudZero surveyed 500 US software engineers at companies with 250 to 10,000 employees. Average monthly AI spend jumped from $63,000 to $85,500 — a 36% increase year over year. And 45% of organizations now plan to spend over $100,000 per month on AI, more than double the 20% who said the same in 2024 (CloudZero, State of AI Costs, 2025).

Here's the kicker: only 51% of those organizations can confidently evaluate their AI ROI. Half the companies spending six figures a month on AI can't tell you whether it's worth it.

Enterprise AI spending rose 320% from $11.5 billion to $37 billion between 2024 and 2025, even as per-token costs dropped dramatically (BuildMVP Fast, 2026). That gap — prices falling while bills rise — is the clearest sign that usage is scaling faster than optimization.

Our finding: The teams spending the most on LLMs aren't the ones with the most sophisticated AI features. They're the ones who shipped AI features early, never revisited model selection, and let usage scale on autopilot. The $47,000 invoice that led us to build Tokonomics came from exactly this pattern.

Why Aren't Prices Falling Fast Enough to Fix This?

They are falling — at an extraordinary rate. In 2025, Epoch AI tracked LLM inference prices and found they dropped at a median rate of 50x per year. After January 2024, the decline accelerated to 200x per year for the cheapest available models at each capability level (Epoch AI, 2025).

Source: Epoch AI + CloudZero, 2025-2026

But usage growth is outpacing the price declines. Model API spending doubled from $3.5 billion to $8.4 billion between late 2024 and mid-2025, and 72% of companies plan to increase their LLM spending further (CloudZero, 2026).

So no — you can't just wait for prices to drop and hope your bill fixes itself. By the time GPT-4o costs what DeepSeek costs today, you'll be running 10x more calls on an even newer frontier model.

The Fix: Route by Task, Cache by Pattern, Cap by Budget

There are three moves that compound together. Most articles cover one of them. Here's what happens when you stack all three.

Move 1: Route calls to the right model

Tag every API call by task type. Once you can see what each call does, you can route:

Classification and extraction → GPT-4o-mini or DeepSeek V3
Conversational support → Claude Haiku 3.5
Complex reasoning → GPT-4o or Claude Sonnet 4

If 60% of your calls shift from GPT-4o to a budget model, that's a 10-15x cost reduction on those calls alone. On a $3,250/month bill, that's roughly $1,950 saved.

For a step-by-step setup, see our guide to per-feature LLM cost tracking.

Move 2: Enable prompt caching

In 2026, Anthropic's prompt caching saves 90% on cached input tokens. OpenAI's automatic caching saves 50% with zero code changes — it just works if you're sending repeated system prompts (Anthropic, OpenAI, 2026).

If your app sends the same system prompt on every call (most chatbots do), you're paying full price for tokens the provider has already processed. That's money on fire.

Our prompt caching guide for OpenAI and Anthropic covers the implementation details.

Move 3: Set hard spending caps

Even with routing and caching, bugs happen. A retry loop with no exit condition. A batch job that processes the same data twice. One night of runaway calls can undo a month of optimization.

Set a monthly budget cap that blocks API calls once you hit the limit. Not an alert you'll read at 9 AM — a hard block that stops the bleeding at 3 AM when nobody's watching.

Here's how hard spending caps work in practice.

The compounding effect

Stacking all three moves doesn't add up linearly. It compounds:

Model routing alone: 50-70% savings
Add prompt caching: another 30-50% on top (applies to remaining spend)
Add budget caps: prevents the 100% overruns that wipe out your savings

A team spending $3,250/month on all-GPT-4o traffic can realistically land at $300-$650/month with the same output quality. That's not a rounding error. That's your next hire's monthly tooling budget.

What Should You Do This Week?

You don't need to rewrite your architecture. Start here:

Audit your current model usage. Tag calls by feature and model. Find out which calls are using GPT-4o for work that GPT-4o-mini can handle. If you don't have visibility yet, Tokonomics shows this in under 5 minutes.
Pick your top 3 highest-volume call types. Run them through a cheaper model with a simple eval (does the output still pass your quality bar?). Most teams find 2 out of 3 work fine. Use our prompt cost optimizer to trim bloated prompts before switching models.
Set a budget alert at 80%. Before you optimize anything, make sure you'll know when costs spike. Here's how to set one up.
Switch one call type this week. Don't try to optimize everything at once. Move your highest-volume, lowest-complexity call to a budget model. Measure for a week. Then move the next one.

The goal isn't to use the cheapest model everywhere. It's to use the right model for each task — and stop paying frontier prices for work that doesn't need it.

Frequently Asked Questions

How much cheaper is DeepSeek V3 compared to GPT-4o?

DeepSeek V3 costs $0.14 per million input tokens vs GPT-4o's $2.50 — that's 18x cheaper on input and 36x cheaper on output ($0.28 vs $10.00). For 1 million requests at 500 input + 200 output tokens, GPT-4o costs $3,250/month vs DeepSeek's $126/month. The quality gap matters for complex reasoning but is negligible for classification and extraction tasks.

What percentage of LLM API calls can use a cheaper model?

In 2026, multiple production operators report that 60-70% of API calls in typical SaaS applications are simple enough for budget models (Prem AI, 2026). These include classification, structured data extraction, short summaries, and sentiment analysis. The remaining 30-40% still benefit from frontier models like GPT-4o or Claude Sonnet 4.

How much can prompt caching save on LLM costs?

Anthropic's prompt caching saves 90% on cached input tokens. OpenAI's automatic caching saves 50% with zero code changes (Anthropic, OpenAI, 2026). Combined with model routing, caching can reduce total LLM spend by 80-95%.

What is the average monthly LLM API spend for companies in 2025?

Average monthly AI spend reached $85,500 per company — a 36% increase from the year before (CloudZero, 2025). And 45% of organizations plan to spend over $100,000/month on AI, more than double the rate in 2024.

How do I track which LLM model each API call uses?

Use a metering proxy like Tokonomics that sits between your app and the LLM provider. Tag each call with metadata using a custom header. The proxy records model, token count, and cost per call — giving you a per-feature breakdown without code changes. Here's how to set it up in 5 minutes.

All sources retrieved June 2026. Pricing data verified against official provider pricing pages.

Sources:

Stack Overflow, 2025 Developer Survey, retrieved 2026-06-08
CloudZero, State of AI Costs 2025, retrieved 2026-06-08
CloudZero, LLM API Pricing Comparison 2026, retrieved 2026-06-08
Epoch AI, LLM Inference Price Trends, retrieved 2026-06-08
Prem AI, LLM Cost Optimization 8 Strategies, retrieved 2026-06-08
EditorialGE, Why Founders Overpay for LLMs, retrieved 2026-06-08
Divyam.AI, The Hidden Cost of LLMflation, retrieved 2026-06-08
BuildMVP Fast, Why AI Agents Are So Expensive, retrieved 2026-06-08
OpenAI, Prompt Caching Guide, retrieved 2026-06-08
Anthropic, Prompt Caching Documentation, retrieved 2026-06-08