How Many Tokens Does Your AI Workflow Actually Use?

Q: Can I reduce token usage without changing my workflow logic?

Yes. Three quick wins: set explicit max_tokens on every LLM call (prevents runaway outputs), use prompt caching for repeated system prompts (50-90% input discount), and switch intermediate steps to cheaper models like GPT-4o-mini. These changes alone can cut costs by 40-70% without touching your workflow structure.

Q: How do embedding tokens compare to completion tokens in cost?

Embedding tokens are dramatically cheaper. OpenAI's text-embedding-3-small costs $0.02/M tokens versus GPT-4o's $2.50/M input tokens, a 125x difference. In a RAG pipeline, the embedding cost is typically less than 1% of the total. The completion step dominates.

TL;DR: Most AI workflows use 2x-10x more tokens than builders expect. A simple summarization step consumes 2,000-4,000 input tokens per call. A RAG pipeline with retrieval and generation can hit 8,000-15,000 tokens per run. Multiply by daily volume and you've got a bill nobody budgeted for.

You built an automation that works. It classifies emails, summarizes documents, generates responses. It runs in n8n, Make, or Zapier without a hitch. But here's the question nobody asks during the build phase: how many tokens is this thing actually burning?

The answer matters more than you think. According to a 2025 arvelopersLM developer survey by Retool, 62% of teams using LLMs in production exceeded their projected API costs within the first three months. The gap between "it works" and "it works affordably" is almost always a token estimation problem.

This guide breaks down token usage for five common workflow patterns, gives you real numbers to estimate with, and shows you where the hidden multipliers live. If you want to run the math before deploying, our cost estimation formulas give you a framework for projecting monthly spend from token counts.

Key Takeaways

Summarization workflows are input-heavy: 80%+ of tokens come from the source document

Classification tasks are the cheapest pattern, often under 500 tokens per call

RAG pipelines have a hidden cost: embedding tokens plus completion tokens per query

Multi-step chains multiply token usage by the number of steps, not linearly but compoundingly

According to OpenAI's tokenizer documentation, 1 token averages about 4 characters in English

Why Do Most Builders Underestimate Token Usage?

Builders underestimate tokens because they think about prompts, not context. A 2025 survey by Weights & Biases found that 71% of teams had no token monitoring in place before hitting an unexpected bill. The system prompt alone, often invisible in no-code interfaces, can account for 30-50% of total input tokens per call.

There are three reasons this happens consistently.

Hidden system prompts. n8n's OpenAI node, Make's GPT module, and Zapier's AI actions all inject system prompts you don't see. These range from 100 to 800 tokens depending on the platform and configuration. Every single run pays for those tokens.

Context accumulation. If your workflow passes conversation history or previous step outputs, the input grows with each step. A 3-step chain doesn't use 3x the tokens of a single step. It uses more like 5-6x, because each step inherits the context of previous steps.

Output variability. You can set max_tokens, but most builders don't. Without a cap, GPT-4o might return 50 tokens for one input and 800 for another. Over thousands of runs, that variability adds up fast.

We've seen workflows where the builder expected $15/month and the actual bill was $190. The culprit was always the same: no visibility into per-step token counts.

Token accumulation across a multi-step AI workflow

How Many Tokens Does a Summarization Workflow Use?

Summarization is the most input-heavy pattern. According to OpenAI's cookbook, a typical 1,500-word document converts to roughly 2,000 tokens, and the summary output averages 150-300 tokens. That makes summarization about 85-90% input cost.

Here's what a standard document summarization step looks like in practice:

Token breakdown for summarization

Component	Tokens (typical)
System prompt	150-400
Source document (1,500 words)	~2,000
User instruction ("Summarize this...")	20-50
Total input	2,170-2,450
Output (summary)	150-300
Total per call	2,320-2,750

At GPT-4o pricing ($2.50/M input, $10.00/M output), one summarization call costs about $0.008. Run it 500 times per day and you're at $120/month. Run it on GPT-4o-mini instead? That drops to roughly $6/month.

The model choice matters enormously for summarization because the input-to-output ratio is so skewed. You're paying mostly for reading, not writing.

Citation capsule: A 1,500-word document summarization on GPT-4o consumes approximately 2,400 input tokens and 200 output tokens per call, costing $0.008 per run according to OpenAI's published pricing as of June 2026.

When summarization gets expensive

The real danger is when you're summarizing long documents. A 10-page PDF might be 5,000-8,000 words, which is 7,000-11,000 tokens. Now each call costs $0.02-0.03 on GPT-4o. At scale, that's the difference between a $50 bill and a $500 bill. Check our model pricing comparison for up-to-date rates across all major providers.

What's the Token Cost of Content Generation Workflows?

Content generation flips the ratio. According to Anthropic's token estimation guide, output tokens cost 3-5x more than input tokens across most providers, making generation workflows the most expensive per-token pattern.

A typical content generation step, say writing a product description or email draft, looks like this:

Token breakdown for generation

Component	Tokens (typical)
System prompt + brand guidelines	300-800
User instruction + context	200-500
Total input	500-1,300
Generated content	500-2,000
Total per call	1,000-3,300

The input is light. The output is where you pay. A 1,000-token output on GPT-4o costs $0.01 by itself. That's 4x what the input costs on the same call.

How to control generation costs

Three things actually work here. First, set max_tokens explicitly. If you need a 200-word product description, cap the output at 400 tokens. Second, use a cheaper model for drafts and a better model for final versions. Third, consider prompt caching if your system prompt is identical across calls, because OpenAI and Anthropic both offer 50-90% discounts on cached input tokens.

Content generation is the one pattern where output pricing dominates. Most cost optimization advice focuses on trimming input. For generation workflows, the bigger win is constraining output length and choosing the right model tier.

Input vs output token costs per call across three workflow patterns at GPT-4o pricing. Summarization is input-heavy, generation is output-heavy, and classification costs almost nothing.

How Cheap Is Classification Compared to Other Patterns?

Classification is the budget-friendly workhorse. A 2025 benchmark by Humanloop showed that simple classification tasks (sentiment, category, intent) average just 200-500 total tokens per call, making them 5-10x cheaper than summarization or generation.

Here's the typical breakdown:

Token breakdown for classification

Component	Tokens (typical)
System prompt + categories	100-300
Input text to classify	50-200
Total input	150-500
Output (label + confidence)	5-30
Total per call	155-530

At GPT-4o-mini pricing ($0.15/M input, $0.60/M output), a classification call costs about $0.00005. Yes, five-thousandths of a cent. You could run 200,000 classification calls for $10.

This is why classification is the one workflow pattern where model costs rarely matter. Even at high volume, it's almost free. The real cost question for classification is latency, not tokens.

But there's a catch. If you're stuffing 20 category descriptions with examples into the system prompt, your "simple" classifier might be using 2,000 input tokens. Always check.

Citation capsule: Simple classification tasks consume 200-500 total tokens per call and cost approximately $0.00005 on GPT-4o-mini, according to OpenAI's pricing as of June 2026, making them 10-50x cheaper than summarization workflows.

How Many Tokens Does a RAG Pipeline Actually Consume?

RAG (retrieval-augmented generation) is the most deceptive pattern for token estimation. According to LlamaIndex documentation, a typical RAG query retrieves 3-5 chunks of 500 tokens each, adding 1,500-2,500 tokens to every completion call on top of the query itself. And that's before you count the embedding cost.

RAG has two token costs that most builders track separately, if they track them at all.

Cost 1: Embedding tokens

Every document you index gets converted to embeddings. Every user query also gets embedded for similarity search. With OpenAI's text-embedding-3-small at $0.02/M tokens, embedding is cheap per call. But indexing a 100-page knowledge base (roughly 150,000 tokens) costs $3.00 upfront, and every query adds $0.000002 for query embedding.

Cost 2: Completion tokens

This is where RAG gets expensive. The retrieved chunks get stuffed into the context window along with the user's question and the system prompt.

Component	Tokens (typical)
System prompt	200-500
Retrieved chunks (3-5 x 500)	1,500-2,500
User query	30-100
Total input	1,730-3,100
Generated answer	200-600
Total per call	1,930-3,700

On GPT-4o, that's $0.006-0.010 per query. At 1,000 queries per day, you're looking at $180-300/month, and that's a single RAG endpoint.

In our own testing, we measured a support chatbot RAG pipeline that used an average of 3,200 tokens per query. The builder estimated 800 tokens because they only counted the user's question. One way to cut these costs dramatically is caching retrieved chunks, which gives you 50-90% discounts on repeated context across queries.

What Happens to Token Usage in Multi-Step Chains?

Multi-step chains are the biggest budget risk. A 2025 analysis by LangSmith showed that chains with 4+ steps averaged 12,000-18,000 total tokens per execution, with the final step consuming 40-60% of the total due to accumulated context.

Here's why: each step typically receives the output of the previous step as part of its input. Token usage doesn't just add up. It compounds.

Example: 4-step content pipeline

Step	Input tokens	Output tokens	Cumulative
1. Research summary	2,500	400	2,900
2. Outline from summary	800	300	4,000
3. Draft from outline	1,200	1,500	6,700
4. Edit + polish	2,000	1,200	9,900
Total	6,500	3,400	9,900

That's nearly 10,000 tokens for one execution. On GPT-4o, about $0.05 per run. Run it 200 times per day and you're at $300/month from a single workflow.

How to reduce chain token usage

The most effective technique is trimming context between steps. Don't pass the full output of step 1 into step 2 if step 2 only needs a subset. Extract what's needed, discard the rest. Some builders call this "context windowing."

Another approach: use cheaper models for intermediate steps and reserve GPT-4o or Claude Sonnet for the final step where quality matters most. Step 1 (summarization) and step 2 (outlining) often work fine on GPT-4o-mini at 1/30th the cost. To see exactly where your tokens go, set up tracking per-step costs in n8n or tracking per-step costs in Make with per-workflow tagging.

Citation capsule: Multi-step AI chains with 4+ steps average 12,000-18,000 total tokens per execution, with the final step consuming 40-60% of the total due to accumulated context, according to LangSmith's 2025 chain analysis.

A Practical Token Estimation Worksheet

Before you deploy any AI workflow, run it through this estimation process. According to Google Cloud's AI cost management guide, teams that estimate token costs before deployment reduce overruns by 40% compared to those who monitor reactively.

Step 1: Map every LLM call

List every step in your workflow that touches an LLM. Include hidden calls like embedding lookups and moderation checks. Most no-code platforms make it easy to miss these.

Step 2: Estimate per-call tokens

Use the pattern breakdowns above. For each step, note:

System prompt length (in tokens)
Input data size (paste into our free LLM token counter or OpenAI's tokenizer to count)
Expected output length
Whether context from previous steps is included

Step 3: Multiply by volume

Your daily or monthly execution count is the biggest multiplier. A workflow that costs $0.01 per run at 50 runs/day is $15/month. At 5,000 runs/day, it's $1,500/month. Same workflow, same tokens, very different bill.

Step 4: Pick the right model per step

Not every step needs your most capable model. Here's a rough guide:

Task	Recommended model	Cost per 1M input tokens
Classification	GPT-4o-mini	$0.15
Summarization	GPT-4o-mini or Claude Haiku	$0.15-0.80
Generation (quality)	GPT-4o or Claude Sonnet	$2.50-3.00
Embedding	text-embedding-3-small	$0.02

Step 5: Add a 30% buffer

Real-world usage always exceeds estimates. Output variability, retries, edge cases with longer inputs. Add 30% to your estimate and you'll be closer to the actual number. For a deeper walkthrough, our full cost estimation guide covers budgeting formulas for every workflow pattern.

Calculator interface for estimating token costs

How Can You Track Actual Token Usage Across Workflows?

Estimation gets you in the ballpark. Measurement tells you what's really happening. According to Datadog's 2025 State of AI Observability report, only 28% of teams using LLMs in production have per-request token visibility, meaning 72% are flying blind.

The simplest approach for no-code builders is a metering proxy. Instead of pointing your AI nodes directly at OpenAI or Anthropic, you route them through a proxy that records every call's token count, cost, and latency. No code changes. One URL swap.

This is what Tokonomics was built for. You change the base URL in your n8n, Make, or Zapier configuration, and every LLM call gets metered automatically. You can see cost per workflow, per model, per day, and set budget alerts before the bill surprises you. Here's how to set it up: getting started with metering.

Frequently Asked Questions

How many tokens is 1,000 words?

Roughly 1,300-1,500 tokens in English, according to OpenAI's tokenizer documentation. The exact count depends on vocabulary complexity. Technical writing with specialized terms tends to tokenize less efficiently, sometimes reaching 1,600-1,800 tokens per 1,000 words.

Do system prompts count toward token usage?

Yes, and they count on every single call. A 500-token system prompt running 1,000 times per day costs the same as processing 500,000 input tokens daily. On GPT-4o, that's $1.25/day just for the system prompt. This is why prompt caching matters.

Which workflow pattern uses the most tokens?

Multi-step chains with context passing consistently use the most tokens, averaging 12,000-18,000 per execution according to LangSmith's analysis. RAG pipelines are second, typically 2,000-4,000 per query. Classification is the cheapest at 200-500 tokens per call.

Can I reduce token usage without changing my workflow logic?

Yes. Three quick wins: set explicit max_tokens on every LLM call (prevents runaway outputs), use prompt caching for repeated system prompts (50-90% input discount), and switch intermediate steps to cheaper models like GPT-4o-mini. These changes alone can cut costs by 40-70% without touching your workflow structure.

How do embedding tokens compare to completion tokens in cost?

Embedding tokens are dramatically cheaper. OpenAI's text-embedding-3-small costs $0.02/M tokens versus GPT-4o's $2.50/M input tokens, a 125x difference. In a RAG pipeline, the embedding cost is typically less than 1% of the total. The completion step dominates.

Conclusion

Token usage varies wildly across workflow patterns. Classification barely registers on the bill. Summarization and RAG are input-heavy. Content generation is output-heavy. Multi-step chains compound costs at each step. The gap between expected and actual token usage is almost always a surprise, and rarely a pleasant one.

The fix isn't complicated. Map your workflow steps, estimate tokens using the patterns above, multiply by your daily volume, and add a 30% buffer. Then measure what actually happens in production. The teams that track per-call token usage don't get surprised by their AI bills.

If you're running AI workflows in n8n, Make, or Zapier, start by checking your actual token usage against your estimates. The gap will tell you exactly where to optimize.

All sources retrieved June 2026.