← Blog
llm-cost-audit ai-spending-review llm-cost-management June 6, 2026 8 min read

How to Audit Your LLM API Spending Monthly

Financial dashboard with charts and analytics representing monthly LLM API cost auditing

TL;DR — Run a 30-minute monthly audit: pull raw numbers, compare month-over-month, break down by model and feature, flag zombie endpoints, check for model mismatches. Tools: OpenAI usage dashboard, Anthropic console, or Tokonomics. Most teams find 20–40% of spend is wasted on the first audit.

Teams that don't audit their LLM spending monthly have no idea where their money goes. They know the total — OpenAI charged $2,800 last month — but they can't tell you which feature consumed 60% of that, which model was used for tasks a cheaper model handles fine, or why spending jumped 40% from April to May.

A monthly audit takes 30 minutes. It catches three things: cost leaks (requests you didn't know were happening), model inefficiencies (expensive models used for cheap tasks), and trend shifts (gradual cost increases that compound into budget problems). Skip it, and you're managing AI costs by invoice surprise.

Here's the exact process.

Step 1: Pull the raw numbers (5 minutes)

Start with the basics. You need four numbers for the current month and the previous month:

Where to find them:

If you're routing through a proxy like Tokonomics, all of this is in one dashboard regardless of how many providers you use. One view, all providers, already broken down by model, API key, and custom tags.

Write these numbers down. You'll compare them against last month.

Step 2: Month-over-month comparison (5 minutes)

Calculate the change for each metric:

Metric Last month This month Change
Total spend $2,100 $2,800 +33%
Total requests 185,000 210,000 +14%
Avg cost/request $0.0114 $0.0133 +17%
Input tokens 92M 118M +28%
Output tokens 31M 38M +23%

The numbers that matter most are the ratios, not the totals.

Requests grew 14% but spend grew 33%. That means each request got more expensive. Why? Either you're using more expensive models, your prompts got longer, or output is longer. Dig into which.

Input tokens grew faster than requests. This usually means system prompts grew, conversation history is accumulating more turns, or RAG is retrieving more chunks per query. Each of these is fixable — see our cost optimization strategies guide.

A healthy baseline. If your spend growth matches your request growth within 5%, your cost per request is stable. That's the goal. If spend is growing faster than requests, something changed — find it.

Step 3: Model mix analysis (5 minutes)

Break down spend by model. This is where the most money hides.

Model Requests Spend Avg cost/req
GPT-4o 45,000 $1,890 $0.042
GPT-4o-mini 140,000 $546 $0.0039
Claude Sonnet 4 25,000 $364 $0.0146

The audit question: For each model, does the task require that model's capability?

Look at your GPT-4o usage. Those 45,000 calls cost $1,890. If even 50% of them could be handled by GPT-4o-mini (classification, simple summarization, data extraction), switching saves $870/month.

This is the single highest-ROI finding in most audits. Teams default to their best model for everything and never revisit. A model-mix audit done once saves money every month going forward.

For a detailed guide on which model fits which task, see our cheapest LLM for each use case breakdown.

Step 4: Find the zombie endpoints (5 minutes)

Zombie endpoints are API calls that still run but no longer serve a purpose. They're surprisingly common:

How to find them: sort your API calls by endpoint or by API key. Look for keys with high volume but no corresponding product feature. Look for consistent, robotic patterns — exactly 1 request per minute is a cron job, not a user.

If you're tagging requests by feature (X-Metering-Tags: {"feature":"chatbot"}), zombies show up immediately as features with spend but no product value.

Step 5: Check your prompt efficiency (5 minutes)

Pull your average input token count per call. If it's climbing month over month, one of these is happening:

System prompt creep. Someone added "be more detailed" instructions, few-shot examples, or reference data to the system prompt. System prompts get longer over time because adding feels free — but every token is billed on every call. A 4,000-token system prompt costs $0.01 per GPT-4o call. Audit it. Cut what doesn't measurably improve output quality.

Conversation history bloat. Multi-turn conversations resend all prior messages. If your average conversation is growing from 3 turns to 5 turns, your input tokens per conversation roughly double. Solutions: summarize old turns, limit history window to the last N messages, or use prompt caching to reduce rebilling of repeated context.

RAG chunk inflation. Your retrieval pipeline is returning more chunks per query — either because the index grew or because similarity thresholds were lowered. More retrieved context means more input tokens. Check if output quality actually improves with the extra chunks.

Step 6: Set targets for next month (5 minutes)

Based on what you found, set 1-3 concrete targets:

Write them down. Review them in next month's audit. This creates accountability. Without specific targets, audits become "interesting but not actionable."

The audit checklist

Here's the complete checklist in one view. Bookmark this for your monthly review:

Numbers pull (5 min)

Trend analysis (5 min)

Model audit (5 min)

Zombie hunt (5 min)

Prompt check (5 min)

Action items (5 min)

Automating the boring parts

The audit steps above are manual. They work, but they rely on someone actually doing them every month. The more you can automate, the more consistently it happens.

Budget alerts handle the biggest risk — spending more than you planned. Set alerts at 50%, 80%, and 100% of your monthly budget. If your budget is $3,000/month, an 80% alert fires at $2,400 and gives you a week to investigate before you hit the cap.

Hard spending caps handle the catastrophic case — a bug that sends 100x normal volume. Caps automatically block requests when your budget is exceeded. No human intervention needed.

For the model-mix and per-feature analysis, Tokonomics does this automatically. Every API call is logged with the model, cost, tokens, and any custom tags you attach. The dashboard shows spend by model, by API key, and by feature — the same breakdowns you'd compute manually, updated in real time. The analytics endpoints also support programmatic access, so you can build your own alerting on top.

What a healthy audit looks like

After three months of auditing, you should see:

  1. Cost per request is stable or declining. You're optimizing model selection and prompt efficiency.
  2. No zombie endpoints. Every API key and feature tag maps to an active product feature.
  3. Model mix is intentional. You know why each model is used and what it would cost to switch.
  4. Budget alerts are set. You hear about cost problems before the invoice, not after.
  5. Targets are tracked. Last month's savings targets were implemented and measured.

The teams that do this well treat LLM spending like any other infrastructure cost — measured, budgeted, and optimized continuously. The teams that don't are the ones who post on Reddit asking why their AI bill surprised them.

Thirty minutes a month. That's the difference.

Last updated June 2026. All sources retrieved June 2026.

About the author
Zouhair is the founder of Tokonomics. He built the platform after receiving a $47,000 LLM invoice that his team didn't see coming. He tracks LLM pricing changes weekly across all major providers.
Connect on LinkedIn →
← Back to Blog