Services

AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital

Industries

Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty

References Technologies

Lab

Blog Know-how Tools

About Collaboration Careers

Language

CS EN

Ollama vs vLLM

14. 03. 2024 1 min read intermediate

Ollama is the simplest path to local LLMs. vLLM is optimized for production serving.

Ollama¶

Simple installation (curl + ollama run)
Model management (pull, list, rm)
REST API compatible with OpenAI
Ideal for development and experimentation
macOS, Linux, Windows

ollama pull llama3.2 ollama run llama3.2 ‘Explain Docker’ curl http://localhost:11434/api/generate -d ‘{“model”:”llama3.2”,”prompt”:”Hello”}’

vLLM¶

PagedAttention — efficient GPU memory management
Continuous batching — high throughput
OpenAI-compatible API server
Tensor parallelism (multi-GPU)
Optimized for production

pip install vllm python -m vllm.entrypoints.openai.api_server \ –model meta-llama/Llama-3-8B-Instruct

Comparison¶

Simplicity: Ollama >> vLLM
Throughput: vLLM >> Ollama (2-5×)
GPU utilization: vLLM better
Model format: Ollama = GGUF, vLLM = HuggingFace
CPU inference: Ollama OK, vLLM GPU-only

Ollama for Dev, vLLM for Production¶

Ollama for local development and experimentation. vLLM for production serving with high throughput.

ollamavllmllmaiinference

Share:

CORE SYSTEMS tým

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Všechny články

Další know-how

The Complete Guide to Ollama + Local AI

Ollama — lokální AI modely, instalace, API, modely, integrace.

ChatGPT in Enterprise — First Impressions and Practical Experience

How we started experimenting with ChatGPT in internal processes. What works, what doesn't, and where we see potential.

Prompt Engineering — The Art of Communicating with AI Models

A practical guide to prompt engineering. Techniques, patterns, and anti-patterns for effective LLM interaction.