Fine-tuning & optimization
The right model for the right task.
Fine-tuning, knowledge distillation, inference optimization. Because using GPT-4 for every query is like driving a truck to buy bread.
When to fine-tune¶
Fine-tuning isn’t always the right answer. Most problems can be solved with better prompts, better context (RAG), or better orchestration. Fine-tuning makes sense in specific situations:
Decision tree¶
AI quality problem?
│
├── Missing knowledge → RAG (add context)
│
├── Wrong output format → Prompt engineering
│
├── Inconsistent behavior → Few-shot examples in prompt
│
├── Still insufficient? → Fine-tuning
│
├── Need to reduce costs → Distillation (large → small model)
│
└── Regulations (on-premise) → Fine-tune open-source + deploy locally
Specific indications for fine-tuning¶
- Domain-specific language — model doesn’t understand your terminology even with context (medicine, law, specific field)
- Consistent format — you need always the same output structure (JSON schema, table, specific template)
- Costs — GPT-4 for 10,000 queries daily is expensive. Fine-tuned Llama 8B for a fraction of the cost.
- Latency — large model = slow. Small fine-tuned = fast (<200ms)
- Data residency — data cannot leave your environment. On-premise = open-source + fine-tuning.
Knowledge distillation¶
Most common form of fine-tuning in practice: large model (GPT-4, Claude) teaches small model (Llama 8B, Mistral 7B) to do specific task.
How it works¶
┌──────────────────────────────────────────────────┐
│ TEACHER MODEL (GPT-4, Claude) │
│ - Large, expensive, slow │
│ - Excellent quality │
│ - Generates training data for student │
└──────────────────┬───────────────────────────────┘
│ Synthetic training data
│ (1000-5000 examples)
▼
┌──────────────────────────────────────────────────┐
│ FINE-TUNING PIPELINE │
│ - Data cleaning & dedup │
│ - LoRA/QLoRA fine-tuning │
│ - Evaluation against teacher │
│ - Iteration (3-5 cycles) │
└──────────────────┬───────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────┐
│ STUDENT MODEL (Llama 8B, Mistral 7B) │
│ - Small, cheap, fast │
│ - 85-95% teacher quality on target domain │
│ - Production deployment (API or on-premise) │
└──────────────────────────────────────────────────┘
Results in practice¶
| Metric | GPT-4 (teacher) | Fine-tuned Llama 8B | Difference |
|---|---|---|---|
| Quality (domain-specific) | 95% | 89% | -6% |
| Latency P95 | 2.5s | 180ms | 14x faster |
| Cost per query | $0.03 | $0.002 | 15x cheaper |
| Data residency | Cloud (US/EU) | On-premise | ✅ |
Typical break-even: with 1000+ queries/day, fine-tuning pays for itself in 2-3 months.
LoRA & QLoRA fine-tuning¶
What is LoRA¶
Low-Rank Adaptation (LoRA) is an efficient fine-tuning method that doesn’t change the entire model. Instead, it adds small adaptation matrices (typically 0.1-1% of parameters) to existing layers. Benefits:
- Fast training — hours instead of days
- Low requirements — 1 GPU is enough (24GB VRAM for 7B model with QLoRA)
- Modularity — you can have multiple LoRA adapters for different use-cases
- Safety — base model remains unchanged, adapter is a small file
Our training pipeline¶
- Data collection — real production data + synthetic from teacher model
- Data cleaning — deduplication, quality filtering, format standardization
- Hyperparameter search — rank, alpha, learning rate, epochs
- Training — LoRA/QLoRA, gradient checkpointing, mixed precision
- Evaluation — against baseline (teacher + base model), golden dataset
- Iteration — if quality doesn’t reach threshold, iterate (more data, different hyperparameters)
- Merge & deploy — merge LoRA into base model, quantize, deploy
Training data quality¶
Quality > quantity. Our rules:
- Diversity — cover full range of use-cases, not just easy cases
- Edge cases — explicitly include difficult examples
- Negative examples — examples where correct answer is “I don’t know” or escalation
- Consistency — same style, format, level of detail
- Validation — domain expert validates 100% of training data
Inference optimization¶
Fine-tuning is half the story. The other half is how to efficiently serve the model in production.
Quantization¶
Reducing computation precision from FP16 to INT8 or INT4:
| Method | Size (7B model) | Quality | Speed |
|---|---|---|---|
| FP16 | 14 GB | 100% (baseline) | 1x |
| INT8 (GPTQ) | 7 GB | 99.5% | 1.3x |
| INT4 (AWQ) | 3.5 GB | 98% | 1.8x |
| INT4 (GGUF) | 4 GB | 97% | 1.5x (CPU!) |
For most production use-cases, INT8 is the sweet spot — minimal quality loss, significant VRAM reduction and speedup.
Batching & KV-cache¶
Continuous batching: Processing multiple requests simultaneously. Framework (vLLM, TGI) automatically batches requests, shares KV-cache between them. Throughput 3-5x higher than naive sequential processing.
KV-cache optimization: PagedAttention (vLLM) efficiently manages memory for KV-cache. Eliminates fragmentation, enables higher batch size.
Speculative decoding¶
Small draft model (e.g. Llama 1B) generates candidate tokens, large model verifies them in one forward pass. 2-3x speedup for tasks with predictable output (structured text, code).
Inference stack for production¶
┌──────────────────────────────────────────────────┐
│ LOAD BALANCER (nginx, envoy) │
│ │ │
│ ▼ │
│ API GATEWAY (rate limiting, auth, routing) │
│ │ │
│ ▼ │
│ INFERENCE SERVER (vLLM / TGI / Triton) │
│ - Continuous batching │
│ - KV-cache management │
│ - Quantized model │
│ - GPU autoscaling │
│ │ │
│ ▼ │
│ MONITORING (latency, throughput, GPU util) │
└──────────────────────────────────────────────────┘
Model selection guide¶
| Use-case | Recommended model | Why |
|---|---|---|
| General assistant | GPT-4 / Claude | Highest quality, no fine-tuning needed |
| Domain QA (high volume) | Fine-tuned Llama 8B | Low costs, low latency |
| Code generation | Fine-tuned CodeLlama / DeepSeek | Code specialization |
| Document extraction | Fine-tuned Mistral 7B | Structured output, consistency |
| Embedding/retrieval | Nomic / BGE / domain-tuned | Retrieval quality on your domain |
| On-premise (regulations) | Llama / Mistral + LoRA | No data to cloud |
Optimization process¶
Phase 1: Analysis (1 week)¶
- Audit existing AI system (models, costs, quality, latency)
- Identify optimization opportunities
- Cost-benefit analysis fine-tuning vs. prompt engineering vs. model swap
Phase 2: Experimentation (2-3 weeks)¶
- Training data collection & preparation
- Fine-tuning experiments (3-5 configurations)
- Evaluation against baselines
- Inference optimization (quantization, batching)
Phase 3: Production (1-2 weeks)¶
- Deploy optimized model (shadow mode → A/B test → full rollout)
- Monitoring setup
- Performance validation on production traffic
Phase 4: Iteration (ongoing)¶
- Continuous feedback data collection
- Periodic re-training (quarterly or on drift)
- Model upgrade evaluation (new base models)
Časté otázky
Prompt engineering is always the first step — it's faster and cheaper. Fine-tuning pays off when: specific domain requires consistent behavior, you need to reduce latency/costs (smaller model), or you must run on-premise (regulations). We analyze your use-case and recommend the optimal approach.
For LoRA fine-tuning typically 500-5000 quality examples. For knowledge distillation we generate synthetic data from large model — just define domain and use-cases. Quality > quantity — 500 perfect examples > 5000 average ones.
Yes, that's one of the main use-cases. We fine-tune Llama, Mistral, Qwen on your data, optimize (quantization, KV-cache), deploy on your infrastructure. No data leaves your environment.
Rigorous evaluation: golden dataset (200+ pairs), A/B test against baseline, regression tests on general capabilities. Fine-tuned model must be better on your domain and can't degrade on general tasks by more than 5%.