TL;DR — Fine-tuning GPT-4o-mini: $3/1M training tokens (one-time), then 2× inference price forever. Fine-tune only if: (1) you need consistent format/style that prompts can't achieve, or (2) you can replace a large model with a fine-tuned small one. The break-even is ~500K inference calls/month.
Fine-tuning has three cost components that most developers don't fully account for: the one-time training cost, the permanently higher inference cost, and the ongoing experimentation cost. Understanding all three before you start determines whether fine-tuning saves you money or becomes an expensive detour.
The short version: fine-tuning is rarely about cost savings. It's about getting a smaller, cheaper model to perform like a bigger, more expensive one. When it works, you save money on inference. When it doesn't, you've spent $50-$500 on training and weeks of engineering time with nothing to show for it.
Fine-tuning pricing by provider
OpenAI
| Cost type | GPT-4o-mini | GPT-4o |
|---|---|---|
| Training (per 1M tokens) | $3.00 | $25.00 |
| Inference input (per 1M) | $0.30 (2x base) | $3.75 (1.5x base) |
| Inference output (per 1M) | $1.20 (2x base) | $15.00 (1.5x base) |
| Storage | Free (first 2 models) | Free (first 2 models) |
Key detail: fine-tuned model inference costs more than the base model. GPT-4o-mini goes from $0.15 to $0.30 per million input tokens — a 2x increase.
Anthropic
Anthropic doesn't offer public fine-tuning as of June 2026. Enterprise customers can access it through custom agreements.
Open-source (self-hosted)
Fine-tuning Llama 3.3 or Mistral is free in terms of training API cost — you're running on your own hardware. But the GPU time isn't free:
| Method | GPU required | Training time (1,000 examples) | Approximate cost |
|---|---|---|---|
| Full fine-tune (70B) | 4x A100 80GB | 4-8 hours | $40-$80 (cloud GPU) |
| LoRA (70B) | 1x A100 80GB | 2-4 hours | $8-$16 (cloud GPU) |
| QLoRA (70B) | 1x A100 40GB | 2-3 hours | $5-$12 (cloud GPU) |
| LoRA (7B-8B) | 1x A10G 24GB | 1-2 hours | $2-$5 (cloud GPU) |
LoRA and QLoRA are parameter-efficient methods that fine-tune a small subset of weights. They cost 5-10x less than full fine-tuning and produce comparable results for most tasks.
The three cost buckets
1. Training cost (one-time, per experiment)
Training cost depends on your dataset size and how many epochs (passes through the data) you run.
Formula:
Training cost = tokens_in_dataset × num_epochs × cost_per_training_token
Example — fine-tuning GPT-4o-mini:
- Dataset: 1,000 examples, averaging 500 tokens each = 500K tokens
- Epochs: 3 (OpenAI default)
- Training tokens: 500K × 3 = 1.5M tokens
- Cost: 1.5M × $3.00/1M = $4.50
That's cheap. But you rarely get it right on the first try. Most fine-tuning projects require 3-5 experiments with different datasets, hyperparameters, or data cleaning strategies:
- Experiment 1: baseline dataset → $4.50
- Experiment 2: cleaned data, more examples → $8.00
- Experiment 3: different prompt format → $6.00
- Experiment 4: more epochs → $12.00
- Total training cost: $30.50
For GPT-4o fine-tuning at $25/1M training tokens, the same process costs $150-$250 in training alone.
2. Inference cost (ongoing, every call)
This is where the math gets important. Fine-tuned models cost more per token than base models on OpenAI:
| Model | Base inference (input) | Fine-tuned inference (input) | Increase |
|---|---|---|---|
| GPT-4o-mini | $0.15/1M | $0.30/1M | +100% |
| GPT-4o | $2.50/1M | $3.75/1M | +50% |
The paradox: You fine-tune to improve quality, but inference gets more expensive. The cost justification only works if fine-tuning lets you use a cheaper model that performs like an expensive one.
The winning scenario:
- You're currently using GPT-4o for a classification task: $2.50/1M input
- You fine-tune GPT-4o-mini to match GPT-4o's accuracy on your task: $0.30/1M input
- Savings: $2.20/1M input tokens (88% reduction)
The losing scenario:
- You're already using GPT-4o-mini with good prompting: $0.15/1M input
- You fine-tune GPT-4o-mini for marginal quality improvement: $0.30/1M input
- Result: you doubled your cost for a small quality gain
3. Experimentation cost (hidden, significant)
The biggest cost isn't tokens — it's engineering time:
- Data preparation: Cleaning, formatting, and curating training examples. Budget 8-20 hours for a quality dataset.
- Evaluation pipeline: You need a systematic way to measure whether the fine-tuned model is actually better. Building eval sets takes 4-8 hours.
- Iteration cycles: Each experiment requires training, evaluation, comparison against the baseline, and deciding next steps. Budget 2-4 hours per iteration.
- Total engineering time: 20-50 hours for a well-executed fine-tuning project.
At $100/hour engineering cost, that's $2,000-$5,000 in time — dwarfing the $30-$250 in training API costs.
When fine-tuning saves money
Fine-tuning is cost-effective in specific scenarios:
Scenario 1: Replacing an expensive model with a cheap fine-tuned one
Before: Using GPT-4o for customer intent classification
- 50,000 calls/day × 200 input tokens = 10M tokens/day
- Cost: $25/day → $750/month
After: Fine-tuned GPT-4o-mini (matches GPT-4o accuracy on your specific task)
- Same volume: 10M tokens/day
- Cost: $3/day → $90/month
- Training cost: ~$30 (one-time)
- Monthly savings: $660
- Payback: less than 1 day
Scenario 2: Eliminating a large system prompt
Before: GPT-4o-mini with a 2,000-token system prompt containing formatting rules, examples, and constraints
- 20,000 calls/day × 2,500 total input tokens = 50M tokens/day
- Cost: $7.50/day → $225/month
After: Fine-tuned GPT-4o-mini that has learned the formatting rules — system prompt drops to 200 tokens
- 20,000 calls/day × 700 total input tokens = 14M tokens/day
- Cost: $4.20/day → $126/month (fine-tuned rate at $0.30/1M)
- Monthly savings: $99
The savings come from eliminating the system prompt tokens, not from the model being cheaper. The fine-tuned model's higher per-token rate is offset by sending far fewer tokens per call.
When fine-tuning wastes money
Scenario 3: Fine-tuning when prompting works fine
Current: GPT-4o-mini with a well-crafted 400-token prompt handles your task at 92% accuracy
- Cost: $0.15/1M → you're already on the cheapest viable option
After fine-tuning: GPT-4o-mini fine-tuned to 95% accuracy
- Cost: $0.30/1M → doubled your inference cost
- Quality improvement: 92% → 95% (marginal)
- Training + engineering cost: $3,000+
Unless that 3% accuracy difference has direct revenue impact, this fine-tuning project destroyed value.
Scenario 4: Fine-tuning with insufficient data
Fine-tuning with fewer than 100 high-quality examples rarely produces meaningful improvement. With 50 examples, you've spent engineering time curating data and training costs, but the model barely shifts from its base behavior.
Minimum viable dataset: 200-500 examples for classification tasks, 500-1,000 for generation tasks. Below these thresholds, invest in better prompting instead.
Fine-tuning vs prompt engineering: cost comparison
| Approach | Upfront cost | Ongoing cost | Engineering time | Quality ceiling |
|---|---|---|---|---|
| Zero-shot prompting | $0 | Base rate | 2-4 hours | Good |
| Few-shot prompting | $0 | Base rate + example tokens | 4-8 hours | Better |
| System prompt optimization | $0 | Base rate | 8-16 hours | Better |
| Prompt caching | $0 | 50-90% off cached tokens | 2-4 hours | Same as prompting |
| Fine-tuning | $30-$500 in training | 1.5-2x base rate | 20-50 hours | Best (for your task) |
The decision rule: Try prompting approaches first. They're cheaper, faster, and reversible. Fine-tune only when you've exhausted prompting and the gap between "prompting quality" and "needed quality" is clear and measurable.
Tracking fine-tuned model costs
Fine-tuned models have different model names in OpenAI's API (e.g., ft:gpt-4o-mini:my-org:custom-name:abc123). If you're tracking costs per model, make sure your cost calculator handles fine-tuned model names:
def get_rate(model):
if model.startswith("ft:gpt-4o-mini"):
return {"input": 0.30, "output": 1.20} # per 1M
elif model.startswith("ft:gpt-4o"):
return {"input": 3.75, "output": 15.00}
# ... base model rates
With Tokonomics, fine-tuned model costs are calculated automatically — the proxy reads the model name from the response and applies the correct fine-tuned rate. Your analytics dashboard shows fine-tuned and base model costs separately so you can verify the savings.
Budget alerts are especially important during fine-tuning experimentation. You might accidentally deploy a fine-tuned model that processes requests at 2x the expected rate. An alert at 80% of your monthly budget catches this before it becomes a problem.
The bottom line
Fine-tuning is a tool, not a default. The cost-saving path is narrow and specific: fine-tune a cheap model to replace an expensive one, or fine-tune to eliminate large system prompts. Outside these scenarios, fine-tuning typically increases costs.
Before fine-tuning, ask:
- Have I tried optimizing my prompts first?
- Have I tried a cheaper base model?
- Is the quality gap between prompting and my target measurable?
- Do I have 200+ high-quality training examples?
- Will the inference savings exceed the training + engineering cost within 3 months?
If the answer to all five is yes, fine-tuning will save you money. If any answer is no, your time is better spent elsewhere.
Last updated June 2026. All sources retrieved June 2026.