_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

Real AI Costs in Production 2026: Optimisation from API to GPU

07. 02. 2026 7 min read CORE SYSTEMSai
Real AI Costs in Production 2026: Optimisation from API to GPU

Real AI Costs in Production 2026

“AI is cheap,” say the vendor slides. Reality: an enterprise company with 50,000 queries per day on a GPT-4 class model pays $15,000–$45,000 per month for inference alone. And that does not include embeddings, fine-tuning or infrastructure. This is a guide to the real costs — and strategies that reduce them by 50–80%.

Pricing Landscape at the Start of 2026

The LLM API market has gone through a massive price war over the past year. Prices have dropped 60–90% compared to early 2024. But beware — the price per token is only part of the story. Real costs depend on how many tokens you generate, and output tokens are 3–5× more expensive than input.

Model (Q1 2026) Input / 1M tokens Output / 1M tokens Typical use case
GPT-4.1 $2.00 $8.00 General purpose, coding
GPT-4.1 mini $0.40 $1.60 Cost-efficient tasks
Claude Sonnet 4 $3.00 $15.00 Complex reasoning, coding
Claude Haiku 3.5 $0.80 $4.00 Fast responses, classification
Claude Opus 4 $15.00 $75.00 Frontier reasoning
Gemini 2.5 Pro $1.25 $10.00 Multimodal, long context
Gemini 2.5 Flash $0.15 $0.60 High-volume, low-cost
DeepSeek V3 $0.28 $0.42 Budget reasoning
Llama 3.3 70B (self-hosted) ~$0.20* ~$0.20* On-premise, data sovereignty

* Self-hosted price is approximate — depends on GPU hardware, utilisation and amortisation. Includes A100/H100 hosting + electricity.

What One Query Costs: Cost per Query Breakdown

A typical enterprise query (RAG pipeline with context) averages 2,000 input tokens (prompt + retrieved context) and 500 output tokens (response). Based on this:

Model Cost per query 50K queries/day Monthly
GPT-4.1 $0.008 $400 $12,000
GPT-4.1 mini $0.0016 $80 $2,400
Claude Sonnet 4 $0.0135 $675 $20,250
Claude Haiku 3.5 $0.0036 $180 $5,400
Gemini 2.5 Flash $0.0006 $30 $900
DeepSeek V3 $0.00077 $38.50 $1,155

The difference between the most expensive and cheapest option is 22×. And we are talking about a simple RAG query. For agentic systems where a single user request generates 5–15 LLM calls, costs multiply accordingly.

Hidden Costs Vendors Do Not Mention

API pricing is the tip of the iceberg. The full TCO includes:

  • Embedding generation — every document in the knowledge base must go through an embedding model. For 100K documents, that is a one-off $50–200, but re-indexing on update costs ongoing
  • Vector database hosting — Pinecone $70+/month, managed Qdrant $100+/month, self-hosted requires RAM (1M vectors ≈ 4–8 GB RAM)
  • Prompt engineering and evals — 20–40% of engineering time goes into prompts, testing and iterations. This is your most expensive cost
  • Observability — LangSmith, Langfuse, custom — $200–2,000/month for production monitoring
  • Guardrails and safety — content filtering, PII detection, compliance checks — additional latency and costs
  • Retry and error handling — rate limits, 5xx errors, timeout retries = 10–20% extra calls

Real-World Example: Enterprise Chatbot

A company with 2,000 employees, internal knowledge base chatbot. 50,000 queries/day, RAG pipeline with Claude Sonnet.

API inference: $20,250/month · Embeddings + vector DB: $500/month · Observability: $500/month · Engineering (0.5 FTE): $5,000/month

Total: ~$26,250/month = $315,000/year

Strategy #1: Semantic Caching

The simplest and most effective optimisation. 30–60% of queries in enterprise chatbots are repeated (or semantically similar). Instead of a new LLM call, you return a cached response.

  • How it works: Query → embedding → similarity search in cache → if similarity > 0.95, return cached response
  • Tools: GPTCache, Redis + vector search, custom implementation with pgvector
  • Typical savings: 30–50% of API calls, latency from 2–5s to <100ms for cache hits
  • Watch out for: Cache invalidation on knowledge base changes, TTL policy, cache poisoning

Strategy #2: Model Routing (Smart Cascading)

Not every query needs a frontier model. “How many employees do we have?” can be handled by a model at $0.0006/query. “Analyse this contract and identify risks” needs a model at $0.013/query.

  • Principle: A classifier (small model or rule-based) evaluates query complexity and routes to the appropriate model
  • Architecture: Input → Complexity classifier → Router → [Small model | Medium model | Large model]
  • Typical split: 60% small model, 30% medium, 10% large = average cost drops by 60–70%
  • Tools: Martian, Portkey, Unify.ai, or a custom router with embeddings-based classification

Routing in Practice: 68% Savings

Without routing: 50,000 queries × Claude Sonnet = $20,250/month

With routing: 30,000 × Gemini Flash ($900) + 15,000 × GPT-4.1 mini ($720) + 5,000 × Claude Sonnet ($2,025) = $3,645/month

Savings: $16,605/month (82%)

Strategy #3: Prompt Optimisation

Every unnecessary token costs money. And most prompts are 2–3× longer than they need to be.

  • System prompt audit: Shorten system prompts. 500 tokens of instructions → 150 tokens with the same result = 70% savings on system prompt overhead
  • Context window management: Do not send the entire conversation history. Summarise, trim or use a sliding window
  • Retrieved context pruning: RAG often returns 5–10 chunks. A reranker (Cohere Rerank, BGE Reranker) selects the top 2–3 and discards the rest
  • Output length control: Set max_tokens. Without a limit, the model generates until it decides to stop — and output tokens are 3–5× more expensive

Strategy #4: Knowledge Distillation

Have a frontier model that handles your use case perfectly? Distil its knowledge into a smaller model. Result: 90% of the quality at 10% of the cost.

  • Process: Large model generates training data → Fine-tune a small model on that data → Deploy the small model
  • Example: GPT-4 generates 10,000 examples for ticket classification → Fine-tune Llama 3.3 8B → Deploy on your own GPU at $0.0002/query
  • When it works: Tasks with clearly defined scope (classification, extraction, summarisation). Does not work for open-ended reasoning
  • Tools: OpenAI fine-tuning API, Anyscale, Modal, custom training pipeline with PEFT/LoRA

Strategy #5: Self-Hosting for High Volume

Above a certain volume, self-hosting is cheaper than API. The break-even point depends on the model and utilisation:

Setup Monthly cost Break-even vs API
Llama 3.3 70B on 2× A100 (cloud) ~$4,500 ~150K queries/day vs GPT-4.1
Llama 3.3 8B on 1× L40S (cloud) ~$800 ~25K queries/day vs GPT-4.1 mini
Mistral 7B on-premise (1× A100) ~$200 (electricity) Immediately, but CapEx $15K–25K

Self-hosting makes sense when: (a) volume exceeds break-even, (b) data must not leave your infrastructure (regulation, compliance), or (c) you need a custom model and fine-tuning is simpler locally.

Bonus: Prompt Caching from Providers

Both Anthropic and OpenAI offer prompt caching at the API level — repeated prefixes (system prompt, conversation context) are cached and charged at a discount:

  • Anthropic: Cached input at 10% of the standard price (90% discount). Cache write at 125% of the standard price. TTL 5 minutes
  • OpenAI: Automatic caching for repeated prefixes. Cached input at 50% of the standard price
  • Impact: For a RAG pipeline with 1,500 tokens of system prompt and 500 tokens of context — a cache hit saves 50–90% of input costs

Optimisation Roadmap: From Day 1 to Month 6

  1. Week 1–2: Instrumentation — Add metrics: cost per request, tokens in/out, latency, model. You cannot optimise what you do not measure
  2. Week 3–4: Prompt optimisation — Shorten prompts, add a reranker, set max_tokens. Savings: 20–30%
  3. Month 2: Semantic caching — Implement caching for repeated queries. Savings: another 20–40%
  4. Month 3: Model routing — Classifier + multi-model setup. Savings: another 30–50%
  5. Month 4–6: Distillation/self-hosting — For high-volume, well-defined tasks. Savings: another 50–80% on those tasks

Conclusion

AI in production does not have to cost hundreds of thousands. But without optimisation, it will. Key takeaways:

  • Price per token is only part of TCO — engineering time, observability and infrastructure are often more expensive than the API
  • Model routing is the single biggest win — 60–80% savings with minimal quality loss
  • Semantic caching is a quick win with ROI within 2 weeks
  • Self-hosting makes sense from 100K+ queries/day or when compliance requires it
  • Start with instrumentation — you cannot optimise what you do not measure
Share:

CORE SYSTEMS

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

Contact us