TL;DR — Use the API until you hit ~50M tokens/day. Self-hosting a 70B model (A100 GPU) costs ~$3,000/month in infrastructure — that's the break-even against GPT-4o at $2.50/M tokens. Add GPU ops, model updates, and downtime risk, and self-hosting makes sense only above 200M+ tokens/day for most startups.
This is a cost question disguised as a technology question. The models available through self-hosting (Llama 3.3, Mistral, Gemma 2) are capable enough for most tasks. The real question is: at your usage volume, is paying per token cheaper than paying for GPU infrastructure?
The short answer: if you process fewer than 50 million tokens per day, the API is almost certainly cheaper. Above that, self-hosting starts to make financial sense — but it comes with engineering costs that most teams underestimate.
The true cost of API usage
API pricing is simple: you pay per token, no infrastructure, no ops.
| Model | Input (per 1M) | Output (per 1M) |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o-mini | $0.15 | $0.60 |
| Claude Sonnet 4 | $3.00 | $15.00 |
| DeepSeek V3 | $0.27 | $1.10 |
For a full breakdown, see our LLM API pricing guide.
What you get with API pricing:
- Zero infrastructure management
- Automatic scaling — 10 requests or 10,000, same experience
- Latest model versions automatically
- No GPU procurement or maintenance
- Pay only for what you use
- 99.9%+ uptime SLA (for paid tiers)
Monthly cost examples at different volumes:
| Volume (tokens/day) | GPT-4o monthly | GPT-4o-mini monthly |
|---|---|---|
| 1M | $125 | $7 |
| 10M | $1,250 | $68 |
| 50M | $6,250 | $338 |
| 100M | $12,500 | $675 |
| 500M | $62,500 | $3,375 |
The true cost of self-hosting
Self-hosting means running an open-source model (Llama 3.3, Mistral, Gemma 2) on your own GPU infrastructure. The per-token cost is effectively zero — you pay for the server regardless of usage.
GPU server costs
| Setup | GPU | Monthly cost | Model capacity |
|---|---|---|---|
| Cloud A100 (80GB) | 1x A100 | $2,500-$3,500 | Llama 3.3 70B (4-bit quantized) |
| Cloud H100 | 1x H100 | $3,500-$5,000 | Llama 3.3 70B (full precision) |
| Cloud 2x A100 | 2x A100 | $5,000-$7,000 | Llama 3.3 70B (full precision, higher throughput) |
| On-prem A100 | 1x A100 | ~$800-$1,200 (amortized over 3 years) | Same as cloud but you manage hardware |
Cloud GPU pricing varies by provider. AWS, GCP, Lambda Labs, RunPod, and Vast.ai all offer different price points. The cheapest option changes monthly — shop around.
Hidden infrastructure costs
The GPU server is not the only cost. Self-hosting requires:
Engineering time. Someone needs to set up the inference server (vLLM, TGI, Ollama), configure batching, manage memory, tune performance, and handle failures. Budget 2-4 weeks of engineer time for initial setup, then 4-8 hours/month for maintenance.
Monitoring and observability. GPU utilization, inference latency, memory usage, queue depth. You need dashboards for all of this. Without monitoring, you won't know when your model server is degrading.
Redundancy. A single GPU server is a single point of failure. For production, you need at least two — one active, one standby. That doubles your GPU cost.
Scaling. API providers scale instantly. Self-hosted infrastructure doesn't. If traffic spikes 3x during a launch, you either pre-provision (paying for idle capacity) or queue requests (degrading user experience).
Model updates. When Meta releases Llama 4, someone needs to download it, benchmark it, test it against your use case, update the deployment, and roll it out. API providers do this for you.
Realistic total cost of self-hosting
| Component | Monthly cost |
|---|---|
| 1x A100 GPU (cloud) | $3,000 |
| Redundant backup server | $3,000 |
| Engineer time (8 hrs/month @ $100/hr) | $800 |
| Monitoring/infra tooling | $200 |
| Total | $7,000/month |
That $7,000 buys you unlimited tokens — but only up to your throughput ceiling. A single A100 running Llama 3.3 70B (4-bit quantized) delivers roughly 30-50 tokens/second per request, handling 5-15 concurrent requests depending on prompt length.
The break-even calculation
Break-even vs GPT-4o ($2.50/1M input + $10/1M output):
$7,000/month ÷ $6.25 per 1M tokens (blended rate) = 1.12 billion tokens/month
You need to process 1.12 billion tokens per month to break even with GPT-4o pricing. That's roughly 37 million tokens per day.
Break-even vs GPT-4o-mini ($0.15/1M input + $0.60/1M output):
$7,000/month ÷ $0.375 per 1M tokens (blended) = 18.7 billion tokens/month
You need 18.7 billion tokens per month — roughly 623 million per day. Almost nobody hits this. GPT-4o-mini is so cheap that self-hosting rarely makes financial sense compared to it.
The key insight: Self-hosting competes with expensive models (GPT-4o, Claude Sonnet), not cheap ones (GPT-4o-mini, DeepSeek V3). If a cheap API model handles your task, self-hosting loses on cost for all but the highest volumes.
When to use the API
Your volume is under 50M tokens/day. At this volume, even GPT-4o via API costs less than self-hosting infrastructure.
You need the best models. GPT-4o and Claude Sonnet 4 outperform open-source models on complex reasoning, nuanced writing, and tool use. If quality directly impacts revenue, the premium model is worth the premium price.
You're a small team. If you have 2-5 engineers, spending one of them on GPU infrastructure management is a 20-50% productivity hit. Use the API and spend that time on your product.
Your usage is unpredictable. Seasonal products, early-stage startups, or features still finding product-market fit. API lets you scale to zero when you don't need it.
You use multiple models. If you route different tasks to GPT-4o, Claude, and DeepSeek based on the use case — which is the smartest cost optimization — replicating that with self-hosted models means running multiple GPU servers.
When to self-host
Your volume exceeds 50M tokens/day on expensive models. At this scale, GPU infrastructure pays for itself. The higher the volume, the bigger the savings.
Data privacy is non-negotiable. Some industries (healthcare, finance, government) require that data never leaves your infrastructure. Self-hosting with on-prem GPUs satisfies this completely. API providers process your data on their servers — even with data processing agreements.
You need fine-tuned models. If you've fine-tuned Llama 3.3 on your proprietary data and the quality difference is significant, you need to host that model yourself. API fine-tuning (OpenAI, Anthropic) works but limits you to that provider.
Latency is critical and predictable. Self-hosted models give you dedicated capacity — no queue, no rate limits, consistent latency. API latency varies based on provider load.
You have the team for it. You need at least one engineer who understands GPU infrastructure, model serving, and inference optimization. If you don't have this person, the maintenance burden will exceed the cost savings.
The hybrid approach
Most teams that scale past $5,000/month in API costs don't go all-in on self-hosting. They use a hybrid:
-
Self-host for high-volume, simple tasks. Classification, extraction, embedding — tasks where Llama 3.3 performs as well as GPT-4o. These are your highest-volume calls and benefit most from unlimited tokens.
-
Use APIs for complex, low-volume tasks. Reasoning, creative writing, code generation — tasks where GPT-4o or Claude Sonnet's quality advantage matters. These are lower volume, so the per-token cost is manageable.
-
Use APIs as fallback. When your self-hosted server is overloaded or down, route overflow to the API. This gives you the cost savings of self-hosting with the reliability of API providers.
This is similar to how teams use a multi-model strategy — but instead of routing between API models, you're routing between self-hosted and API-hosted models.
Decision framework
| Factor | API wins | Self-hosted wins |
|---|---|---|
| Volume < 50M tokens/day | ✅ | |
| Volume > 100M tokens/day | ✅ | |
| Team size < 10 engineers | ✅ | |
| Data must stay on-prem | ✅ | |
| Need GPT-4o/Claude quality | ✅ | |
| Need fine-tuned models | ✅ | |
| Unpredictable usage | ✅ | |
| Stable, predictable load | ✅ | |
| Budget < $5K/month | ✅ | |
| Budget > $10K/month | Evaluate both | Evaluate both |
Managing costs regardless of approach
Whether you use APIs, self-host, or go hybrid, you need cost visibility:
- API costs: Track every call with exact token counts and costs. Tokonomics does this automatically across all providers — one dashboard for OpenAI, Anthropic, DeepSeek, and any OpenAI-compatible API.
- Self-hosted costs: Track GPU utilization, throughput, and cost per request to compare against API alternatives.
- Both: Set budget alerts so you know when spending drifts from your plan. Audit monthly to catch model-mix inefficiencies.
The build vs buy decision isn't permanent. Start with APIs (zero infrastructure, instant scale), monitor your costs as you grow, and evaluate self-hosting when your monthly bill consistently exceeds $5,000-$10,000 on a single model. The data you collect now — volume, token patterns, model distribution — is what makes that evaluation possible later.
Last updated June 2026. All sources retrieved June 2026.