Real AI Costs in Production 2026¶
“AI is cheap,” say the vendor slides. Reality: an enterprise company with 50,000 queries per day on a GPT-4 class model pays $15,000–$45,000 per month for inference alone. And that does not include embeddings, fine-tuning or infrastructure. This is a guide to the real costs — and strategies that reduce them by 50–80%.
Pricing Landscape at the Start of 2026¶
The LLM API market has gone through a massive price war over the past year. Prices have dropped 60–90% compared to early 2024. But beware — the price per token is only part of the story. Real costs depend on how many tokens you generate, and output tokens are 3–5× more expensive than input.
| Model (Q1 2026) | Input / 1M tokens | Output / 1M tokens | Typical use case |
|---|---|---|---|
| GPT-4.1 | $2.00 | $8.00 | General purpose, coding |
| GPT-4.1 mini | $0.40 | $1.60 | Cost-efficient tasks |
| Claude Sonnet 4 | $3.00 | $15.00 | Complex reasoning, coding |
| Claude Haiku 3.5 | $0.80 | $4.00 | Fast responses, classification |
| Claude Opus 4 | $15.00 | $75.00 | Frontier reasoning |
| Gemini 2.5 Pro | $1.25 | $10.00 | Multimodal, long context |
| Gemini 2.5 Flash | $0.15 | $0.60 | High-volume, low-cost |
| DeepSeek V3 | $0.28 | $0.42 | Budget reasoning |
| Llama 3.3 70B (self-hosted) | ~$0.20* | ~$0.20* | On-premise, data sovereignty |
* Self-hosted price is approximate — depends on GPU hardware, utilisation and amortisation. Includes A100/H100 hosting + electricity.
What One Query Costs: Cost per Query Breakdown¶
A typical enterprise query (RAG pipeline with context) averages 2,000 input tokens (prompt + retrieved context) and 500 output tokens (response). Based on this:
| Model | Cost per query | 50K queries/day | Monthly |
|---|---|---|---|
| GPT-4.1 | $0.008 | $400 | $12,000 |
| GPT-4.1 mini | $0.0016 | $80 | $2,400 |
| Claude Sonnet 4 | $0.0135 | $675 | $20,250 |
| Claude Haiku 3.5 | $0.0036 | $180 | $5,400 |
| Gemini 2.5 Flash | $0.0006 | $30 | $900 |
| DeepSeek V3 | $0.00077 | $38.50 | $1,155 |
The difference between the most expensive and cheapest option is 22×. And we are talking about a simple RAG query. For agentic systems where a single user request generates 5–15 LLM calls, costs multiply accordingly.
Hidden Costs Vendors Do Not Mention¶
API pricing is the tip of the iceberg. The full TCO includes:
- Embedding generation — every document in the knowledge base must go through an embedding model. For 100K documents, that is a one-off $50–200, but re-indexing on update costs ongoing
- Vector database hosting — Pinecone $70+/month, managed Qdrant $100+/month, self-hosted requires RAM (1M vectors ≈ 4–8 GB RAM)
- Prompt engineering and evals — 20–40% of engineering time goes into prompts, testing and iterations. This is your most expensive cost
- Observability — LangSmith, Langfuse, custom — $200–2,000/month for production monitoring
- Guardrails and safety — content filtering, PII detection, compliance checks — additional latency and costs
- Retry and error handling — rate limits, 5xx errors, timeout retries = 10–20% extra calls
Real-World Example: Enterprise Chatbot¶
A company with 2,000 employees, internal knowledge base chatbot. 50,000 queries/day, RAG pipeline with Claude Sonnet.
API inference: $20,250/month · Embeddings + vector DB: $500/month · Observability: $500/month · Engineering (0.5 FTE): $5,000/month
Total: ~$26,250/month = $315,000/year
Strategy #1: Semantic Caching¶
The simplest and most effective optimisation. 30–60% of queries in enterprise chatbots are repeated (or semantically similar). Instead of a new LLM call, you return a cached response.
- How it works: Query → embedding → similarity search in cache → if similarity > 0.95, return cached response
- Tools: GPTCache, Redis + vector search, custom implementation with pgvector
- Typical savings: 30–50% of API calls, latency from 2–5s to <100ms for cache hits
- Watch out for: Cache invalidation on knowledge base changes, TTL policy, cache poisoning
Strategy #2: Model Routing (Smart Cascading)¶
Not every query needs a frontier model. “How many employees do we have?” can be handled by a model at $0.0006/query. “Analyse this contract and identify risks” needs a model at $0.013/query.
- Principle: A classifier (small model or rule-based) evaluates query complexity and routes to the appropriate model
- Architecture: Input → Complexity classifier → Router → [Small model | Medium model | Large model]
- Typical split: 60% small model, 30% medium, 10% large = average cost drops by 60–70%
- Tools: Martian, Portkey, Unify.ai, or a custom router with embeddings-based classification
Routing in Practice: 68% Savings¶
Without routing: 50,000 queries × Claude Sonnet = $20,250/month
With routing: 30,000 × Gemini Flash ($900) + 15,000 × GPT-4.1 mini ($720) + 5,000 × Claude Sonnet ($2,025) = $3,645/month
Savings: $16,605/month (82%)
Strategy #3: Prompt Optimisation¶
Every unnecessary token costs money. And most prompts are 2–3× longer than they need to be.
- System prompt audit: Shorten system prompts. 500 tokens of instructions → 150 tokens with the same result = 70% savings on system prompt overhead
- Context window management: Do not send the entire conversation history. Summarise, trim or use a sliding window
- Retrieved context pruning: RAG often returns 5–10 chunks. A reranker (Cohere Rerank, BGE Reranker) selects the top 2–3 and discards the rest
- Output length control: Set max_tokens. Without a limit, the model generates until it decides to stop — and output tokens are 3–5× more expensive
Strategy #4: Knowledge Distillation¶
Have a frontier model that handles your use case perfectly? Distil its knowledge into a smaller model. Result: 90% of the quality at 10% of the cost.
- Process: Large model generates training data → Fine-tune a small model on that data → Deploy the small model
- Example: GPT-4 generates 10,000 examples for ticket classification → Fine-tune Llama 3.3 8B → Deploy on your own GPU at $0.0002/query
- When it works: Tasks with clearly defined scope (classification, extraction, summarisation). Does not work for open-ended reasoning
- Tools: OpenAI fine-tuning API, Anyscale, Modal, custom training pipeline with PEFT/LoRA
Strategy #5: Self-Hosting for High Volume¶
Above a certain volume, self-hosting is cheaper than API. The break-even point depends on the model and utilisation:
| Setup | Monthly cost | Break-even vs API |
|---|---|---|
| Llama 3.3 70B on 2× A100 (cloud) | ~$4,500 | ~150K queries/day vs GPT-4.1 |
| Llama 3.3 8B on 1× L40S (cloud) | ~$800 | ~25K queries/day vs GPT-4.1 mini |
| Mistral 7B on-premise (1× A100) | ~$200 (electricity) | Immediately, but CapEx $15K–25K |
Self-hosting makes sense when: (a) volume exceeds break-even, (b) data must not leave your infrastructure (regulation, compliance), or (c) you need a custom model and fine-tuning is simpler locally.
Bonus: Prompt Caching from Providers¶
Both Anthropic and OpenAI offer prompt caching at the API level — repeated prefixes (system prompt, conversation context) are cached and charged at a discount:
- Anthropic: Cached input at 10% of the standard price (90% discount). Cache write at 125% of the standard price. TTL 5 minutes
- OpenAI: Automatic caching for repeated prefixes. Cached input at 50% of the standard price
- Impact: For a RAG pipeline with 1,500 tokens of system prompt and 500 tokens of context — a cache hit saves 50–90% of input costs
Optimisation Roadmap: From Day 1 to Month 6¶
- Week 1–2: Instrumentation — Add metrics: cost per request, tokens in/out, latency, model. You cannot optimise what you do not measure
- Week 3–4: Prompt optimisation — Shorten prompts, add a reranker, set max_tokens. Savings: 20–30%
- Month 2: Semantic caching — Implement caching for repeated queries. Savings: another 20–40%
- Month 3: Model routing — Classifier + multi-model setup. Savings: another 30–50%
- Month 4–6: Distillation/self-hosting — For high-volume, well-defined tasks. Savings: another 50–80% on those tasks
Conclusion¶
AI in production does not have to cost hundreds of thousands. But without optimisation, it will. Key takeaways:
- Price per token is only part of TCO — engineering time, observability and infrastructure are often more expensive than the API
- Model routing is the single biggest win — 60–80% savings with minimal quality loss
- Semantic caching is a quick win with ROI within 2 weeks
- Self-hosting makes sense from 100K+ queries/day or when compliance requires it
- Start with instrumentation — you cannot optimise what you do not measure
Need help with implementation?
Our experts can help with design, implementation, and operations. From architecture to production.
Contact us