_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

Fine-tuning & optimization

The right model for the right task.

Fine-tuning, knowledge distillation, inference optimization. Because using GPT-4 for every query is like driving a truck to buy bread.

-70%
Costs after optimization
-5x
Latency after optimization
85-95%
Quality vs. baseline
<6 months
ROI typically

When to fine-tune

Fine-tuning isn’t always the right answer. Most problems can be solved with better prompts, better context (RAG), or better orchestration. Fine-tuning makes sense in specific situations:

Decision tree

AI quality problem?
    │
    ├── Missing knowledge → RAG (add context)
    │
    ├── Wrong output format → Prompt engineering
    │
    ├── Inconsistent behavior → Few-shot examples in prompt
    │
    ├── Still insufficient? → Fine-tuning
    │
    ├── Need to reduce costs → Distillation (large → small model)
    │
    └── Regulations (on-premise) → Fine-tune open-source + deploy locally

Specific indications for fine-tuning

  1. Domain-specific language — model doesn’t understand your terminology even with context (medicine, law, specific field)
  2. Consistent format — you need always the same output structure (JSON schema, table, specific template)
  3. Costs — GPT-4 for 10,000 queries daily is expensive. Fine-tuned Llama 8B for a fraction of the cost.
  4. Latency — large model = slow. Small fine-tuned = fast (<200ms)
  5. Data residency — data cannot leave your environment. On-premise = open-source + fine-tuning.

Knowledge distillation

Most common form of fine-tuning in practice: large model (GPT-4, Claude) teaches small model (Llama 8B, Mistral 7B) to do specific task.

How it works

┌──────────────────────────────────────────────────┐
│  TEACHER MODEL (GPT-4, Claude)                    │
│  - Large, expensive, slow                         │
│  - Excellent quality                              │
│  - Generates training data for student            │
└──────────────────┬───────────────────────────────┘
                   │ Synthetic training data
                   │ (1000-5000 examples)
                   ▼
┌──────────────────────────────────────────────────┐
│  FINE-TUNING PIPELINE                             │
│  - Data cleaning & dedup                          │
│  - LoRA/QLoRA fine-tuning                        │
│  - Evaluation against teacher                     │
│  - Iteration (3-5 cycles)                        │
└──────────────────┬───────────────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────────────┐
│  STUDENT MODEL (Llama 8B, Mistral 7B)            │
│  - Small, cheap, fast                            │
│  - 85-95% teacher quality on target domain       │
│  - Production deployment (API or on-premise)      │
└──────────────────────────────────────────────────┘

Results in practice

Metric GPT-4 (teacher) Fine-tuned Llama 8B Difference
Quality (domain-specific) 95% 89% -6%
Latency P95 2.5s 180ms 14x faster
Cost per query $0.03 $0.002 15x cheaper
Data residency Cloud (US/EU) On-premise

Typical break-even: with 1000+ queries/day, fine-tuning pays for itself in 2-3 months.

LoRA & QLoRA fine-tuning

What is LoRA

Low-Rank Adaptation (LoRA) is an efficient fine-tuning method that doesn’t change the entire model. Instead, it adds small adaptation matrices (typically 0.1-1% of parameters) to existing layers. Benefits:

  • Fast training — hours instead of days
  • Low requirements — 1 GPU is enough (24GB VRAM for 7B model with QLoRA)
  • Modularity — you can have multiple LoRA adapters for different use-cases
  • Safety — base model remains unchanged, adapter is a small file

Our training pipeline

  1. Data collection — real production data + synthetic from teacher model
  2. Data cleaning — deduplication, quality filtering, format standardization
  3. Hyperparameter search — rank, alpha, learning rate, epochs
  4. Training — LoRA/QLoRA, gradient checkpointing, mixed precision
  5. Evaluation — against baseline (teacher + base model), golden dataset
  6. Iteration — if quality doesn’t reach threshold, iterate (more data, different hyperparameters)
  7. Merge & deploy — merge LoRA into base model, quantize, deploy

Training data quality

Quality > quantity. Our rules:

  • Diversity — cover full range of use-cases, not just easy cases
  • Edge cases — explicitly include difficult examples
  • Negative examples — examples where correct answer is “I don’t know” or escalation
  • Consistency — same style, format, level of detail
  • Validation — domain expert validates 100% of training data

Inference optimization

Fine-tuning is half the story. The other half is how to efficiently serve the model in production.

Quantization

Reducing computation precision from FP16 to INT8 or INT4:

Method Size (7B model) Quality Speed
FP16 14 GB 100% (baseline) 1x
INT8 (GPTQ) 7 GB 99.5% 1.3x
INT4 (AWQ) 3.5 GB 98% 1.8x
INT4 (GGUF) 4 GB 97% 1.5x (CPU!)

For most production use-cases, INT8 is the sweet spot — minimal quality loss, significant VRAM reduction and speedup.

Batching & KV-cache

Continuous batching: Processing multiple requests simultaneously. Framework (vLLM, TGI) automatically batches requests, shares KV-cache between them. Throughput 3-5x higher than naive sequential processing.

KV-cache optimization: PagedAttention (vLLM) efficiently manages memory for KV-cache. Eliminates fragmentation, enables higher batch size.

Speculative decoding

Small draft model (e.g. Llama 1B) generates candidate tokens, large model verifies them in one forward pass. 2-3x speedup for tasks with predictable output (structured text, code).

Inference stack for production

┌──────────────────────────────────────────────────┐
│  LOAD BALANCER (nginx, envoy)                     │
│       │                                           │
│       ▼                                           │
│  API GATEWAY (rate limiting, auth, routing)       │
│       │                                           │
│       ▼                                           │
│  INFERENCE SERVER (vLLM / TGI / Triton)          │
│  - Continuous batching                            │
│  - KV-cache management                           │
│  - Quantized model                               │
│  - GPU autoscaling                               │
│       │                                           │
│       ▼                                           │
│  MONITORING (latency, throughput, GPU util)       │
└──────────────────────────────────────────────────┘

Model selection guide

Use-case Recommended model Why
General assistant GPT-4 / Claude Highest quality, no fine-tuning needed
Domain QA (high volume) Fine-tuned Llama 8B Low costs, low latency
Code generation Fine-tuned CodeLlama / DeepSeek Code specialization
Document extraction Fine-tuned Mistral 7B Structured output, consistency
Embedding/retrieval Nomic / BGE / domain-tuned Retrieval quality on your domain
On-premise (regulations) Llama / Mistral + LoRA No data to cloud

Optimization process

Phase 1: Analysis (1 week)

  • Audit existing AI system (models, costs, quality, latency)
  • Identify optimization opportunities
  • Cost-benefit analysis fine-tuning vs. prompt engineering vs. model swap

Phase 2: Experimentation (2-3 weeks)

  • Training data collection & preparation
  • Fine-tuning experiments (3-5 configurations)
  • Evaluation against baselines
  • Inference optimization (quantization, batching)

Phase 3: Production (1-2 weeks)

  • Deploy optimized model (shadow mode → A/B test → full rollout)
  • Monitoring setup
  • Performance validation on production traffic

Phase 4: Iteration (ongoing)

  • Continuous feedback data collection
  • Periodic re-training (quarterly or on drift)
  • Model upgrade evaluation (new base models)

Časté otázky

Prompt engineering is always the first step — it's faster and cheaper. Fine-tuning pays off when: specific domain requires consistent behavior, you need to reduce latency/costs (smaller model), or you must run on-premise (regulations). We analyze your use-case and recommend the optimal approach.

For LoRA fine-tuning typically 500-5000 quality examples. For knowledge distillation we generate synthetic data from large model — just define domain and use-cases. Quality > quantity — 500 perfect examples > 5000 average ones.

Yes, that's one of the main use-cases. We fine-tune Llama, Mistral, Qwen on your data, optimize (quantization, KV-cache), deploy on your infrastructure. No data leaves your environment.

Rigorous evaluation: golden dataset (200+ pairs), A/B test against baseline, regression tests on general capabilities. Fine-tuned model must be better on your domain and can't degrade on general tasks by more than 5%.

Máte projekt?

Pojďme si o něm promluvit.

Domluvit schůzku