_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

Embedding modely — srovnání pro produkci

01. 06. 2025 4 min read intermediate

Embedding models are key to modern AI applications, but choosing the right one for production can be complex. We compare the most popular models in terms of performance, speed, cost, and quality for various use cases.

Embedding Models in Production: Practical Comparison

Choosing the right embedding model for production deployment is a critical decision that affects both your RAG system quality and overall costs. In this article, we compare the most used models from the perspective of performance, costs, and practical deployment.

Key Selection Criteria

Before comparing specific models, it’s important to define what we’re evaluating:

  • Embedding quality: MTEB score, ability to capture semantics
  • Inference speed: latency and throughput in production
  • Costs: price per token or time unit
  • Integration: API availability, self-hosting options
  • Multilingual support: support for various languages

Overview of Main Candidates

OpenAI text-embedding-3-large

Currently the highest-quality commercial embedding model with exceptional performance on MTEB benchmark (score 64.6).

import openai

client = openai.OpenAI(api_key="your-key")

response = client.embeddings.create(
    model="text-embedding-3-large",
    input=["Payment gateway API documentation", "System technical specification"],
    dimensions=1536  # Can be reduced to save costs
)

embeddings = [data.embedding for data in response.data]
print(f"Vector dimension: {len(embeddings[0])}")

Advantages: Top quality, stable API, dimensionality support. Disadvantages: Higher costs ($0.13/1M tokens), dependence on external API.

Sentence-BERT Models

Open-source alternative with self-hosting capability. The all-MiniLM-L6-v2 model offers a good performance/speed ratio.

from sentence_transformers import SentenceTransformer
import numpy as np

# Local model - one-time download
model = SentenceTransformer('all-MiniLM-L6-v2')

texts = [
    "Microservices architecture implementation",
    "Distributed database design",
    "Application performance optimization"
]

embeddings = model.encode(texts)
print(f"Shape: {embeddings.shape}")

# Similarity calculation
similarity = np.dot(embeddings[0], embeddings[1])
print(f"Similarity: {similarity:.3f}")

Advantages: No API costs, full control, fast inference. Disadvantages: Lower quality than top commercial models, limited multilingual support.

Cohere Embed v3

Specialized embedding model with advanced compression and multilingual capabilities.

import cohere

co = cohere.Client("your-api-key")

response = co.embed(
    texts=["Database design", "System architecture"],
    model="embed-multilingual-v3.0",
    input_type="search_document",  # or "search_query"
    embedding_types=["float", "int8"]  # Compression to save space
)

# Float embeddings for maximum precision
float_embeddings = response.embeddings.float_

# Int8 embeddings for memory savings (4x smaller)
compressed_embeddings = response.embeddings.int8

Advantages: Good multilingual support, compression options, fast API. Disadvantages: Medium-high costs ($0.1/1M tokens).

Practical Quality Testing

For validating quality on your data, I recommend this approach:

import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

def evaluate_embeddings(model_func, test_pairs):
    """
    test_pairs: List of tuples (text1, text2, expected_similarity)
    """
    results = []

    for text1, text2, expected in test_pairs:
        emb1 = model_func(text1)
        emb2 = model_func(text2)

        actual = cosine_similarity([emb1], [emb2])[0][0]
        results.append({
            'text1': text1,
            'text2': text2,
            'expected': expected,
            'actual': actual,
            'diff': abs(expected - actual)
        })

    return pd.DataFrame(results)

# Test data specific to your domain
test_data = [
    ("REST API documentation", "API reference guide", 0.8),
    ("Database migration", "Database schema", 0.6),
    ("Frontend components", "Backend services", 0.3)
]

# Model comparison
results_openai = evaluate_embeddings(openai_embed_func, test_data)
results_sbert = evaluate_embeddings(sbert_embed_func, test_data)

Cost and Performance Optimization

For production deployment, optimization is key:

Embedding Caching

import redis
import hashlib
import json

class EmbeddingCache:
    def __init__(self, redis_client, model_name):
        self.redis = redis_client
        self.model_name = model_name
        self.ttl = 86400 * 7  # 7 days

    def _get_key(self, text):
        text_hash = hashlib.md5(text.encode()).hexdigest()
        return f"emb:{self.model_name}:{text_hash}"

    def get_embedding(self, text, embed_func):
        key = self._get_key(text)
        cached = self.redis.get(key)

        if cached:
            return json.loads(cached)

        # Cache miss - compute embedding
        embedding = embed_func(text)
        self.redis.setex(key, self.ttl, json.dumps(embedding))
        return embedding

# Usage
cache = EmbeddingCache(redis_client, "text-embedding-3-large")
embedding = cache.get_embedding(document_text, openai_embed_func)

Batch Processing

For larger volumes of data, use batch processing:

def batch_embed_documents(documents, batch_size=100):
    """Efficient processing of large document volumes"""
    embeddings = []

    for i in range(0, len(documents), batch_size):
        batch = documents[i:i + batch_size]

        # OpenAI API supports batch requests
        response = client.embeddings.create(
            model="text-embedding-3-large",
            input=batch
        )

        batch_embeddings = [data.embedding for data in response.data]
        embeddings.extend(batch_embeddings)

        # Rate limiting
        time.sleep(0.1)

    return embeddings

Recommendations for Different Use Cases

High quality, medium volume: OpenAI text-embedding-3-large with caching

Large volume, cost control: Self-hosted Sentence-BERT or multilingual-e5-large

Multilingual: Cohere Embed v3 or mBERT variant

Real-time applications: Local model with GPU acceleration

A/B Testing Implementation

class EmbeddingABTest:
    def __init__(self, model_a, model_b, split_ratio=0.5):
        self.model_a = model_a
        self.model_b = model_b
        self.split_ratio = split_ratio
        self.metrics = {'a': [], 'b': []}

    def get_embedding(self, text, user_id):
        # Consistent splitting based on user_id
        use_a = hash(user_id) % 100 < (self.split_ratio * 100)

        if use_a:
            result = self.model_a.embed(text)
            variant = 'a'
        else:
            result = self.model_b.embed(text)
            variant = 'b'

        return result, variant

    def log_retrieval_quality(self, variant, relevance_score):
        """Log quality metrics for test evaluation"""
        self.metrics[variant].append(relevance_score)

Summary

Choosing an embedding model depends on your project’s specific requirements. OpenAI text-embedding-3-large offers the best quality for critical applications, while open-source alternatives like Sentence-BERT provide control over costs and data. The key is testing on your own data and implementing metrics for continuous quality evaluation. Don’t forget caching and batch processing for performance optimization in production.

embeddingssrovnánírag
Share:

CORE SYSTEMS tým

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.