_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

Transformer Architecture — A Complete Guide

15. 09. 2024 4 min read intermediate

Transformer architecture represents breakthrough technology in artificial intelligence that powers the most modern language models like GPT or BERT. This guide will simply explain how key mechanisms like attention and self-attention work, and why Transformers are so effective.

What is Transformer Architecture

Transformer architecture represents a revolutionary approach to sequence processing that has dominated Natural Language Processing since 2017. The key innovation is the self-attention mechanism, which allows the model to track relationships between all positions in a sequence simultaneously, unlike sequential processing in RNN or LSTM networks.

The basic principle lies in transforming input token sequences into vectors using an attention mechanism that weights the importance of individual positions for each element in the sequence. This allows the model to learn contextual word representations, where meaning depends on the entire sentence context.

Transformer Architecture

Encoder-Decoder Structure

The original Transformer consists of two main parts:

  • Encoder - processes input sequence and creates contextual representations
  • Decoder - generates output sequence based on encoded representations

Each part contains a stack of several identical layers (typically 6), with each layer having two main components: multi-head attention and feed-forward network.

Self-Attention Mechanism

The heart of the Transformer is scaled dot-product attention:

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q: Query matrix [batch_size, seq_len, d_model]
    K: Key matrix [batch_size, seq_len, d_model]
    V: Value matrix [batch_size, seq_len, d_model]
    """
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)

    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)

    attention_weights = F.softmax(scores, dim=-1)
    output = torch.matmul(attention_weights, V)

    return output, attention_weights

The mechanism works by creating three vectors for each token - Query (what I’m looking for), Key (what I’m offering) and Value (what I’m passing). Attention score is calculated as dot product between Query and Key vectors, normalized by vector length.

Multi-Head Attention

Multi-head attention allows the model to track different types of relationships simultaneously:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # Split into heads
        Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # Attention for each head
        attention, _ = scaled_dot_product_attention(Q, K, V, mask)

        # Concatenate heads
        attention = attention.transpose(1, 2).contiguous().view(
            batch_size, -1, self.d_model
        )

        return self.W_o(attention)

Positional Encoding

Since Transformers don’t have an inherent concept of order, we must explicitly encode token positions. Sinusoidal functions are used:

def positional_encoding(seq_len, d_model):
    pe = torch.zeros(seq_len, d_model)
    position = torch.arange(0, seq_len).unsqueeze(1).float()

    div_term = torch.exp(torch.arange(0, d_model, 2).float() *
                        -(math.log(10000.0) / d_model))

    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)

    return pe

This encoding allows the model to distinguish positions and learn distance-dependent relationships between tokens.

Transformer Architecture Variants

BERT (Bidirectional Encoder Representations)

BERT uses only the encoder part and trains bidirectionally using masked language modeling:

  • Randomly masks 15% of tokens in input sequence
  • Learns to predict masked tokens based on entire context
  • Excellent for text understanding tasks (classification, NER, QA)

GPT (Generative Pre-trained Transformer)

GPT uses only the decoder part with causal masking:

def create_causal_mask(seq_len):
    """Creates mask for autoregressive generation"""
    mask = torch.tril(torch.ones(seq_len, seq_len))
    return mask.unsqueeze(0).unsqueeze(0)  # [1, 1, seq_len, seq_len]

This mask ensures that when predicting token at position i, the model sees only tokens at positions 0 through i-1.

Basic Transformer Implementation

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Self-attention with residual connection
        attn_output = self.attention(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))

        # Feed-forward with residual connection
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))

        return x

Advantages and Disadvantages

Advantages:

  • Parallelization - unlike RNN, all positions can be trained simultaneously
  • Long dependencies - attention mechanism enables direct connection of distant positions
  • Interpretability - attention weights provide insight into what the model focuses on
  • Transfer learning - pre-trained models can be fine-tuned for specific tasks

Disadvantages:

  • Computational complexity - O(n²) with respect to sequence length
  • Memory requirements - attention matrix grows quadratically
  • Data requirements - requires large amounts of training data for good performance

Summary

Transformer architecture represents a fundamental advance in sequence processing. Its key innovations - self-attention mechanism and parallelizability - enabled the creation of advanced models like GPT, BERT and their successors. For practice, it’s important to understand that different variants (encoder-only, decoder-only, encoder-decoder) are suitable for different types of tasks. While implementation can be complex, the principles are elegant and provide a solid foundation for understanding modern AI systems.

transformergptbert
Share:

CORE SYSTEMS tým

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.