Real AI Costs in Production 2026¶

“AI is cheap,” say the vendor slides. Reality: an enterprise company with 50,000 queries per day on a GPT-4 class model pays $15,000–$45,000 per month for inference alone. And that does not include embeddings, fine-tuning or infrastructure. This is a guide to the real costs — and strategies that reduce them by 50–80%.

Pricing Landscape at the Start of 2026¶

The LLM API market has gone through a massive price war over the past year. Prices have dropped 60–90% compared to early 2024. But beware — the price per token is only part of the story. Real costs depend on how many tokens you generate, and output tokens are 3–5× more expensive than input.

Model (Q1 2026)	Input / 1M tokens	Output / 1M tokens	Typical use case
GPT-4.1	$2.00	$8.00	General purpose, coding
GPT-4.1 mini	$0.40	$1.60	Cost-efficient tasks
Claude Sonnet 4	$3.00	$15.00	Complex reasoning, coding
Claude Haiku 3.5	$0.80	$4.00	Fast responses, classification
Claude Opus 4	$15.00	$75.00	Frontier reasoning
Gemini 2.5 Pro	$1.25	$10.00	Multimodal, long context
Gemini 2.5 Flash	$0.15	$0.60	High-volume, low-cost
DeepSeek V3	$0.28	$0.42	Budget reasoning
Llama 3.3 70B (self-hosted)	~$0.20*	~$0.20*	On-premise, data sovereignty

* Self-hosted price is approximate — depends on GPU hardware, utilisation and amortisation. Includes A100/H100 hosting + electricity.

What One Query Costs: Cost per Query Breakdown¶

A typical enterprise query (RAG pipeline with context) averages 2,000 input tokens (prompt + retrieved context) and 500 output tokens (response). Based on this:

Model	Cost per query	50K queries/day	Monthly
GPT-4.1	$0.008	$400	$12,000
GPT-4.1 mini	$0.0016	$80	$2,400
Claude Sonnet 4	$0.0135	$675	$20,250
Claude Haiku 3.5	$0.0036	$180	$5,400
Gemini 2.5 Flash	$0.0006	$30	$900
DeepSeek V3	$0.00077	$38.50	$1,155

The difference between the most expensive and cheapest option is 22×. And we are talking about a simple RAG query. For agentic systems where a single user request generates 5–15 LLM calls, costs multiply accordingly.

Hidden Costs Vendors Do Not Mention¶

API pricing is the tip of the iceberg. The full TCO includes:

Embedding generation — every document in the knowledge base must go through an embedding model. For 100K documents, that is a one-off $50–200, but re-indexing on update costs ongoing
Vector database hosting — Pinecone $70+/month, managed Qdrant $100+/month, self-hosted requires RAM (1M vectors ≈ 4–8 GB RAM)
Prompt engineering and evals — 20–40% of engineering time goes into prompts, testing and iterations. This is your most expensive cost
Observability — LangSmith, Langfuse, custom — $200–2,000/month for production monitoring
Guardrails and safety — content filtering, PII detection, compliance checks — additional latency and costs
Retry and error handling — rate limits, 5xx errors, timeout retries = 10–20% extra calls

Real-World Example: Enterprise Chatbot¶

A company with 2,000 employees, internal knowledge base chatbot. 50,000 queries/day, RAG pipeline with Claude Sonnet.

API inference: $20,250/month · Embeddings + vector DB: $500/month · Observability: $500/month · Engineering (0.5 FTE): $5,000/month

Total: ~$26,250/month = $315,000/year

Strategy #1: Semantic Caching¶

The simplest and most effective optimisation. 30–60% of queries in enterprise chatbots are repeated (or semantically similar). Instead of a new LLM call, you return a cached response.

How it works: Query → embedding → similarity search in cache → if similarity > 0.95, return cached response
Tools: GPTCache, Redis + vector search, custom implementation with pgvector
Typical savings: 30–50% of API calls, latency from 2–5s to <100ms for cache hits
Watch out for: Cache invalidation on knowledge base changes, TTL policy, cache poisoning

Strategy #2: Model Routing (Smart Cascading)¶

Not every query needs a frontier model. “How many employees do we have?” can be handled by a model at $0.0006/query. “Analyse this contract and identify risks” needs a model at $0.013/query.

Principle: A classifier (small model or rule-based) evaluates query complexity and routes to the appropriate model
Architecture: Input → Complexity classifier → Router → [Small model | Medium model | Large model]
Typical split: 60% small model, 30% medium, 10% large = average cost drops by 60–70%
Tools: Martian, Portkey, Unify.ai, or a custom router with embeddings-based classification

Routing in Practice: 68% Savings¶

Without routing: 50,000 queries × Claude Sonnet = $20,250/month

With routing: 30,000 × Gemini Flash ($900) + 15,000 × GPT-4.1 mini ($720) + 5,000 × Claude Sonnet ($2,025) = $3,645/month

Savings: $16,605/month (82%)

Strategy #3: Prompt Optimisation¶

Every unnecessary token costs money. And most prompts are 2–3× longer than they need to be.

System prompt audit: Shorten system prompts. 500 tokens of instructions → 150 tokens with the same result = 70% savings on system prompt overhead
Context window management: Do not send the entire conversation history. Summarise, trim or use a sliding window
Retrieved context pruning: RAG often returns 5–10 chunks. A reranker (Cohere Rerank, BGE Reranker) selects the top 2–3 and discards the rest
Output length control: Set max_tokens. Without a limit, the model generates until it decides to stop — and output tokens are 3–5× more expensive

Strategy #4: Knowledge Distillation¶

Have a frontier model that handles your use case perfectly? Distil its knowledge into a smaller model. Result: 90% of the quality at 10% of the cost.

Process: Large model generates training data → Fine-tune a small model on that data → Deploy the small model
Example: GPT-4 generates 10,000 examples for ticket classification → Fine-tune Llama 3.3 8B → Deploy on your own GPU at $0.0002/query
When it works: Tasks with clearly defined scope (classification, extraction, summarisation). Does not work for open-ended reasoning
Tools: OpenAI fine-tuning API, Anyscale, Modal, custom training pipeline with PEFT/LoRA

Strategy #5: Self-Hosting for High Volume¶

Above a certain volume, self-hosting is cheaper than API. The break-even point depends on the model and utilisation:

Setup	Monthly cost	Break-even vs API
Llama 3.3 70B on 2× A100 (cloud)	~$4,500	~150K queries/day vs GPT-4.1
Llama 3.3 8B on 1× L40S (cloud)	~$800	~25K queries/day vs GPT-4.1 mini
Mistral 7B on-premise (1× A100)	~$200 (electricity)	Immediately, but CapEx $15K–25K

Self-hosting makes sense when: (a) volume exceeds break-even, (b) data must not leave your infrastructure (regulation, compliance), or (c) you need a custom model and fine-tuning is simpler locally.

Bonus: Prompt Caching from Providers¶

Both Anthropic and OpenAI offer prompt caching at the API level — repeated prefixes (system prompt, conversation context) are cached and charged at a discount:

Anthropic: Cached input at 10% of the standard price (90% discount). Cache write at 125% of the standard price. TTL 5 minutes
OpenAI: Automatic caching for repeated prefixes. Cached input at 50% of the standard price
Impact: For a RAG pipeline with 1,500 tokens of system prompt and 500 tokens of context — a cache hit saves 50–90% of input costs

Optimisation Roadmap: From Day 1 to Month 6¶

Week 1–2: Instrumentation — Add metrics: cost per request, tokens in/out, latency, model. You cannot optimise what you do not measure
Week 3–4: Prompt optimisation — Shorten prompts, add a reranker, set max_tokens. Savings: 20–30%
Month 2: Semantic caching — Implement caching for repeated queries. Savings: another 20–40%
Month 3: Model routing — Classifier + multi-model setup. Savings: another 30–50%
Month 4–6: Distillation/self-hosting — For high-volume, well-defined tasks. Savings: another 50–80% on those tasks

Conclusion¶

AI in production does not have to cost hundreds of thousands. But without optimisation, it will. Key takeaways:

Price per token is only part of TCO — engineering time, observability and infrastructure are often more expensive than the API
Model routing is the single biggest win — 60–80% savings with minimal quality loss
Semantic caching is a quick win with ROI within 2 weeks
Self-hosting makes sense from 100K+ queries/day or when compliance requires it
Start with instrumentation — you cannot optimise what you do not measure

CORE SYSTEMS

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

Real AI Costs in Production 2026: Optimisation from API to GPU