Fine-tuning & optimization

The right model for the right task.

Fine-tuning, knowledge distillation, inference optimization. Because using GPT-4 for every query is like driving a truck to buy bread.

Schedule consultation Back to AI & Agentic Systems

-70%

Costs after optimization

-5x

Latency after optimization

85-95%

Quality vs. baseline

<6 months

ROI typically

When to fine-tune¶

Fine-tuning isn’t always the right answer. Most problems can be solved with better prompts, better context (RAG), or better orchestration. Fine-tuning makes sense in specific situations:

Decision tree¶

AI quality problem?
    │
    ├── Missing knowledge → RAG (add context)
    │
    ├── Wrong output format → Prompt engineering
    │
    ├── Inconsistent behavior → Few-shot examples in prompt
    │
    ├── Still insufficient? → Fine-tuning
    │
    ├── Need to reduce costs → Distillation (large → small model)
    │
    └── Regulations (on-premise) → Fine-tune open-source + deploy locally

Specific indications for fine-tuning¶

Domain-specific language — model doesn’t understand your terminology even with context (medicine, law, specific field)
Consistent format — you need always the same output structure (JSON schema, table, specific template)
Costs — GPT-4 for 10,000 queries daily is expensive. Fine-tuned Llama 8B for a fraction of the cost.
Latency — large model = slow. Small fine-tuned = fast (<200ms)
Data residency — data cannot leave your environment. On-premise = open-source + fine-tuning.

Knowledge distillation¶

Most common form of fine-tuning in practice: large model (GPT-4, Claude) teaches small model (Llama 8B, Mistral 7B) to do specific task.

How it works¶

┌──────────────────────────────────────────────────┐
│  TEACHER MODEL (GPT-4, Claude)                    │
│  - Large, expensive, slow                         │
│  - Excellent quality                              │
│  - Generates training data for student            │
└──────────────────┬───────────────────────────────┘
                   │ Synthetic training data
                   │ (1000-5000 examples)
                   ▼
┌──────────────────────────────────────────────────┐
│  FINE-TUNING PIPELINE                             │
│  - Data cleaning & dedup                          │
│  - LoRA/QLoRA fine-tuning                        │
│  - Evaluation against teacher                     │
│  - Iteration (3-5 cycles)                        │
└──────────────────┬───────────────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────────────┐
│  STUDENT MODEL (Llama 8B, Mistral 7B)            │
│  - Small, cheap, fast                            │
│  - 85-95% teacher quality on target domain       │
│  - Production deployment (API or on-premise)      │
└──────────────────────────────────────────────────┘

Results in practice¶

Metric	GPT-4 (teacher)	Fine-tuned Llama 8B	Difference
Quality (domain-specific)	95%	89%	-6%
Latency P95	2.5s	180ms	14x faster
Cost per query	$0.03	$0.002	15x cheaper
Data residency	Cloud (US/EU)	On-premise	✅

Typical break-even: with 1000+ queries/day, fine-tuning pays for itself in 2-3 months.

LoRA & QLoRA fine-tuning¶

What is LoRA¶

Low-Rank Adaptation (LoRA) is an efficient fine-tuning method that doesn’t change the entire model. Instead, it adds small adaptation matrices (typically 0.1-1% of parameters) to existing layers. Benefits:

Fast training — hours instead of days
Low requirements — 1 GPU is enough (24GB VRAM for 7B model with QLoRA)
Modularity — you can have multiple LoRA adapters for different use-cases
Safety — base model remains unchanged, adapter is a small file

Our training pipeline¶

Data collection — real production data + synthetic from teacher model
Data cleaning — deduplication, quality filtering, format standardization
Hyperparameter search — rank, alpha, learning rate, epochs
Training — LoRA/QLoRA, gradient checkpointing, mixed precision
Evaluation — against baseline (teacher + base model), golden dataset
Iteration — if quality doesn’t reach threshold, iterate (more data, different hyperparameters)
Merge & deploy — merge LoRA into base model, quantize, deploy

Training data quality¶

Quality > quantity. Our rules:

Diversity — cover full range of use-cases, not just easy cases
Edge cases — explicitly include difficult examples
Negative examples — examples where correct answer is “I don’t know” or escalation
Consistency — same style, format, level of detail
Validation — domain expert validates 100% of training data

Inference optimization¶

Fine-tuning is half the story. The other half is how to efficiently serve the model in production.

Quantization¶

Reducing computation precision from FP16 to INT8 or INT4:

Method	Size (7B model)	Quality	Speed
FP16	14 GB	100% (baseline)	1x
INT8 (GPTQ)	7 GB	99.5%	1.3x
INT4 (AWQ)	3.5 GB	98%	1.8x
INT4 (GGUF)	4 GB	97%	1.5x (CPU!)

For most production use-cases, INT8 is the sweet spot — minimal quality loss, significant VRAM reduction and speedup.

Batching & KV-cache¶

Continuous batching: Processing multiple requests simultaneously. Framework (vLLM, TGI) automatically batches requests, shares KV-cache between them. Throughput 3-5x higher than naive sequential processing.

KV-cache optimization: PagedAttention (vLLM) efficiently manages memory for KV-cache. Eliminates fragmentation, enables higher batch size.

Speculative decoding¶

Small draft model (e.g. Llama 1B) generates candidate tokens, large model verifies them in one forward pass. 2-3x speedup for tasks with predictable output (structured text, code).

Inference stack for production¶

┌──────────────────────────────────────────────────┐
│  LOAD BALANCER (nginx, envoy)                     │
│       │                                           │
│       ▼                                           │
│  API GATEWAY (rate limiting, auth, routing)       │
│       │                                           │
│       ▼                                           │
│  INFERENCE SERVER (vLLM / TGI / Triton)          │
│  - Continuous batching                            │
│  - KV-cache management                           │
│  - Quantized model                               │
│  - GPU autoscaling                               │
│       │                                           │
│       ▼                                           │
│  MONITORING (latency, throughput, GPU util)       │
└──────────────────────────────────────────────────┘

Model selection guide¶

Use-case	Recommended model	Why
General assistant	GPT-4 / Claude	Highest quality, no fine-tuning needed
Domain QA (high volume)	Fine-tuned Llama 8B	Low costs, low latency
Code generation	Fine-tuned CodeLlama / DeepSeek	Code specialization
Document extraction	Fine-tuned Mistral 7B	Structured output, consistency
Embedding/retrieval	Nomic / BGE / domain-tuned	Retrieval quality on your domain
On-premise (regulations)	Llama / Mistral + LoRA	No data to cloud

Optimization process¶

Phase 1: Analysis (1 week)¶

Audit existing AI system (models, costs, quality, latency)
Identify optimization opportunities
Cost-benefit analysis fine-tuning vs. prompt engineering vs. model swap

Phase 2: Experimentation (2-3 weeks)¶

Training data collection & preparation
Fine-tuning experiments (3-5 configurations)
Evaluation against baselines
Inference optimization (quantization, batching)

Phase 3: Production (1-2 weeks)¶

Deploy optimized model (shadow mode → A/B test → full rollout)
Monitoring setup
Performance validation on production traffic

Phase 4: Iteration (ongoing)¶

Continuous feedback data collection
Periodic re-training (quarterly or on drift)
Model upgrade evaluation (new base models)

Časté otázky

Prompt engineering is always the first step — it's faster and cheaper. Fine-tuning pays off when: specific domain requires consistent behavior, you need to reduce latency/costs (smaller model), or you must run on-premise (regulations). We analyze your use-case and recommend the optimal approach.

For LoRA fine-tuning typically 500-5000 quality examples. For knowledge distillation we generate synthetic data from large model — just define domain and use-cases. Quality > quantity — 500 perfect examples > 5000 average ones.

Yes, that's one of the main use-cases. We fine-tune Llama, Mistral, Qwen on your data, optimize (quantization, KV-cache), deploy on your infrastructure. No data leaves your environment.

Rigorous evaluation: golden dataset (200+ pairs), A/B test against baseline, regression tests on general capabilities. Fine-tuned model must be better on your domain and can't degrade on general tasks by more than 5%.

Souvisí s

AI & Agentic Systems {'cs': 'Stavíme AI agenty s governance, bezpečností a produkčním provozem.', 'en': 'We build AI agents with governance, security, and production operations.'}

Data Platform & Integration {'cs': 'ETL/ELT, data lakehouse, real-time pipelines.', 'en': 'ETL/ELT, data lakehouse, real-time pipelines.'}

Banking & Finance {'cs': 'Core banking, compliance, real-time zpracování', 'en': 'Core banking, compliance, real-time processing'}

Logistics & E-commerce {'cs': 'Supply chain, WMS, fulfillment automatizace', 'en': 'Supply chain, WMS, fulfillment automation'}

Máte projekt?

Pojďme si o něm promluvit.

Domluvit schůzku