Embedding Models — How AI Understands Text

Embedding models are the fundamental building blocks of modern AI for working with text. They enable computers to convert words and sentences into numerical vectors that capture their meaning and allow comparison of text similarity.

What Are Embedding Models¶

Embedding models represent fundamental technology that enables machines to “understand” text similarly to humans. While computers only work with numbers, text consists of words and phrases. Embedding models act as a translator that converts words, sentences, or entire documents into vectors of numbers while preserving semantic meaning.

A key property of quality embeddings is the ability to capture relationships between words. For example, words “king” and “queen” will have similar vector representations, as will “Paris” and “France”. These relationships manifest as distances in vector space.

How Embeddings Work in Practice¶

Imagine each word as a point in multi-dimensional space. Modern embedding models typically use 300-1536 dimensions. Words with similar meanings are placed close to each other, while words with different meanings are distant.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Embedding Models — How AI Understands Text
embeddings = {
    'king': np.array([0.1, 0.8, 0.3]),
    'queen': np.array([0.2, 0.7, 0.4]),
    'man': np.array([0.0, 0.9, 0.1]),
    'woman': np.array([0.1, 0.6, 0.2]),
    'car': np.array([0.8, 0.1, 0.9])
}

# Calculate similarity between "king" and "queen"
similarity = cosine_similarity([embeddings['king']], [embeddings['queen']])
print(f"King-queen similarity: {similarity[0][0]:.3f}")

Word2Vec: Pioneer of Modern Embeddings¶

Word2Vec, introduced by Google in 2013, was the first successful embedding model. It uses two architectures:

CBOW (Continuous Bag of Words): Predicts word based on context
Skip-gram: Predicts context based on word

from gensim.models import Word2Vec
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

# Data preparation
text = "Artificial intelligence helps companies automate processes. Machine learning is part of AI."
sentences = [word_tokenize(sent.lower()) for sent in sent_tokenize(text)]

# Training Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Get vector for word
vector = model.wv['intelligence']
print(f"Vector for 'intelligence': {vector[:5]}...")  # First 5 dimensions

# Find similar words
similar_words = model.wv.most_similar('artificial', topn=3)
print(f"Words similar to 'artificial': {similar_words}")

Modern Transformer-Based Embeddings¶

While Word2Vec creates static representations, modern models like BERT, RoBERTa, or OpenAI Ada generate contextual embeddings. The same word has different vectors depending on the context in which it appears.

from transformers import AutoTokenizer, AutoModel
import torch

# Load English BERT model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

def get_sentence_embedding(text):
    # Tokenization
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)

    # Forward pass
    with torch.no_grad():
        outputs = model(**inputs)

    # Mean pooling of last hidden states
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings.numpy()

# Get embeddings for sentences
sentence1 = "Artificial intelligence is changing the world."
sentence2 = "AI is transforming our society."

embedding1 = get_sentence_embedding(sentence1)
embedding2 = get_sentence_embedding(sentence2)

# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(embedding1, embedding2)[0][0]
print(f"Sentence similarity: {similarity:.3f}")

OpenAI Embeddings API¶

For production use, we often utilize ready-made APIs. OpenAI offers powerful embedding models available via REST API:

import openai
import numpy as np

# Set API key
openai.api_key = "your-api-key"

def get_openai_embedding(text, model="text-embedding-ada-002"):
    response = openai.Embedding.create(
        input=text,
        model=model
    )
    return np.array(response['data'][0]['embedding'])

# Get embeddings
texts = [
    "How to implement document search?",
    "What's the best way to do full-text search?",
    "Recipes for goulash with dumplings"
]

embeddings = [get_openai_embedding(text) for text in texts]

# Compare similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(embeddings)
print("Similarity matrix:")
print(similarity_matrix)

Practical Applications of Embeddings¶

Semantic Search¶

Traditional keyword search looks for exact matches. Semantic search using embeddings understands meaning and finds relevant results even without exact keyword matches.

import faiss
import numpy as np

class SemanticSearch:
    def __init__(self):
        self.index = None
        self.documents = []
        self.embeddings = []

    def add_documents(self, docs, embeddings):
        self.documents.extend(docs)
        self.embeddings.extend(embeddings)

        # Create FAISS index for fast search
        dimension = len(embeddings[0])
        self.index = faiss.IndexFlatIP(dimension)  # Inner product
        self.index.add(np.array(embeddings).astype('float32'))

    def search(self, query_embedding, top_k=5):
        # Search for most similar documents
        scores, indices = self.index.search(
            np.array([query_embedding]).astype('float32'), 
            top_k
        )

        results = []
        for score, idx in zip(scores[0], indices[0]):
            results.append({
                'document': self.documents[idx],
                'score': float(score)
            })

        return results

# Usage
search_engine = SemanticSearch()
documents = [
    "Python is a programming language",
    "JavaScript is used for web development",
    "Machine learning algorithms in Python"
]

# Add documents (in real application, embeddings would be computed)
doc_embeddings = [get_openai_embedding(doc) for doc in documents]
search_engine.add_documents(documents, doc_embeddings)

# Search
query = "web application programming"
query_embedding = get_openai_embedding(query)
results = search_engine.search(query_embedding, top_k=2)

for result in results:
    print(f"Score: {result['score']:.3f} - {result['document']}")

Clustering and Classification¶

Embeddings enable grouping similar documents or classifying them into categories based on semantic similarity:

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Document clustering
def cluster_documents(embeddings, n_clusters=3):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    clusters = kmeans.fit_predict(embeddings)
    return clusters, kmeans

# PCA for visualization
def visualize_embeddings(embeddings, labels, documents):
    pca = PCA(n_components=2)
    reduced_embeddings = pca.fit_transform(embeddings)

    plt.figure(figsize=(10, 8))
    for i, doc in enumerate(documents):
        plt.scatter(reduced_embeddings[i, 0], reduced_embeddings[i, 1], 
                   c=f'C{labels[i]}', s=100)
        plt.annotate(doc[:30] + '...', 
                    (reduced_embeddings[i, 0], reduced_embeddings[i, 1]))

    plt.title('Document Embeddings Visualization')
    plt.xlabel('PCA Component 1')
    plt.ylabel('PCA Component 2')
    plt.show()

# Usage
clusters, model = cluster_documents(np.array(doc_embeddings))
print(f"Clusters: {clusters}")

Optimization and Best Practices¶

Choosing the Right Model¶

Different embedding models suit different purposes:

Word2Vec/FastText: Fast, suitable for basic tasks
BERT/RoBERTa: Contextual, better for complex NLP tasks
Sentence-BERT: Optimized for sentence similarity
OpenAI Ada-002: Universal, high quality

Caching and Optimization¶

import pickle
import hashlib
from functools import lru_cache

class EmbeddingCache:
    def __init__(self, cache_file="embeddings_cache.pkl"):
        self.cache_file = cache_file
        self.cache = self._load_cache()

    def _load_cache(self):
        try:
            with open(self.cache_file, 'rb') as f:
                return pickle.load(f)
        except FileNotFoundError:
            return {}

    def _save_cache(self):
        with open(self.cache_file, 'wb') as f:
            pickle.dump(self.cache, f)

    def get_embedding(self, text, model_func):
        # Create text hash
        text_hash = hashlib.md5(text.encode()).hexdigest()

        if text_hash in self.cache:
            return self.cache[text_hash]

        # Calculate new embedding
        embedding = model_func(text)
        self.cache[text_hash] = embedding
        self._save_cache()

        return embedding

# Usage
cache = EmbeddingCache()
embedding = cache.get_embedding("test text", get_openai_embedding)

Summary¶

Embedding models are key technology in modern NLP that enables machines to understand text at the semantic level. From classic Word2Vec models to modern transformer-based architectures, embeddings find application in a wide range of use cases - from search through recommendation systems to advanced text analysis. Proper model selection, efficient caching, and understanding the underlying principles will help you build robust AI applications based on natural language processing.

embeddingsword2vecnlp

CORE SYSTEMS team

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

All articles