Transformer architecture represents breakthrough technology in artificial intelligence that powers the most modern language models like GPT or BERT. This guide will simply explain how key mechanisms like attention and self-attention work, and why Transformers are so effective.
What is Transformer Architecture¶
Transformer architecture represents a revolutionary approach to sequence processing that has dominated Natural Language Processing since 2017. The key innovation is the self-attention mechanism, which allows the model to track relationships between all positions in a sequence simultaneously, unlike sequential processing in RNN or LSTM networks.
The basic principle lies in transforming input token sequences into vectors using an attention mechanism that weights the importance of individual positions for each element in the sequence. This allows the model to learn contextual word representations, where meaning depends on the entire sentence context.
Transformer Architecture¶
Encoder-Decoder Structure¶
The original Transformer consists of two main parts:
- Encoder - processes input sequence and creates contextual representations
- Decoder - generates output sequence based on encoded representations
Each part contains a stack of several identical layers (typically 6), with each layer having two main components: multi-head attention and feed-forward network.
Self-Attention Mechanism¶
The heart of the Transformer is scaled dot-product attention:
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q: Query matrix [batch_size, seq_len, d_model]
K: Key matrix [batch_size, seq_len, d_model]
V: Value matrix [batch_size, seq_len, d_model]
"""
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attention_weights = F.softmax(scores, dim=-1)
output = torch.matmul(attention_weights, V)
return output, attention_weights
The mechanism works by creating three vectors for each token - Query (what I’m looking for), Key (what I’m offering) and Value (what I’m passing). Attention score is calculated as dot product between Query and Key vectors, normalized by vector length.
Multi-Head Attention¶
Multi-head attention allows the model to track different types of relationships simultaneously:
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, query, key, value, mask=None):
batch_size = query.size(0)
# Split into heads
Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
# Attention for each head
attention, _ = scaled_dot_product_attention(Q, K, V, mask)
# Concatenate heads
attention = attention.transpose(1, 2).contiguous().view(
batch_size, -1, self.d_model
)
return self.W_o(attention)
Positional Encoding¶
Since Transformers don’t have an inherent concept of order, we must explicitly encode token positions. Sinusoidal functions are used:
def positional_encoding(seq_len, d_model):
pe = torch.zeros(seq_len, d_model)
position = torch.arange(0, seq_len).unsqueeze(1).float()
div_term = torch.exp(torch.arange(0, d_model, 2).float() *
-(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
return pe
This encoding allows the model to distinguish positions and learn distance-dependent relationships between tokens.
Transformer Architecture Variants¶
BERT (Bidirectional Encoder Representations)¶
BERT uses only the encoder part and trains bidirectionally using masked language modeling:
- Randomly masks 15% of tokens in input sequence
- Learns to predict masked tokens based on entire context
- Excellent for text understanding tasks (classification, NER, QA)
GPT (Generative Pre-trained Transformer)¶
GPT uses only the decoder part with causal masking:
def create_causal_mask(seq_len):
"""Creates mask for autoregressive generation"""
mask = torch.tril(torch.ones(seq_len, seq_len))
return mask.unsqueeze(0).unsqueeze(0) # [1, 1, seq_len, seq_len]
This mask ensures that when predicting token at position i, the model sees only tokens at positions 0 through i-1.
Basic Transformer Implementation¶
class TransformerBlock(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.attention = MultiHeadAttention(d_model, num_heads)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.feed_forward = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Linear(d_ff, d_model)
)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Self-attention with residual connection
attn_output = self.attention(x, x, x, mask)
x = self.norm1(x + self.dropout(attn_output))
# Feed-forward with residual connection
ff_output = self.feed_forward(x)
x = self.norm2(x + self.dropout(ff_output))
return x
Advantages and Disadvantages¶
Advantages:¶
- Parallelization - unlike RNN, all positions can be trained simultaneously
- Long dependencies - attention mechanism enables direct connection of distant positions
- Interpretability - attention weights provide insight into what the model focuses on
- Transfer learning - pre-trained models can be fine-tuned for specific tasks
Disadvantages:¶
- Computational complexity - O(n²) with respect to sequence length
- Memory requirements - attention matrix grows quadratically
- Data requirements - requires large amounts of training data for good performance
Summary¶
Transformer architecture represents a fundamental advance in sequence processing. Its key innovations - self-attention mechanism and parallelizability - enabled the creation of advanced models like GPT, BERT and their successors. For practice, it’s important to understand that different variants (encoder-only, decoder-only, encoder-decoder) are suitable for different types of tasks. While implementation can be complex, the principles are elegant and provide a solid foundation for understanding modern AI systems.