Transformer architecture represents breakthrough technology in artificial intelligence that powers the most modern language models like GPT or BERT. This guide will simply explain how key mechanisms like attention and self-attention work, and why Transformers are so effective.
Was ist die Transformer-Architektur¶
Transformer architecture represents a revolutionary approach to sequence processing that has dominated Natural Language Processing since 2017. The key innovation is the self-attention mechanism, which allows the model to track relationships between all positions in a sequence simultaneously, unlike sequential processing in RNN or LSTM networks.
The basic principle lies in transforming input token sequences into vectors using an attention mechanism that weights the importance of individual positions for each element in the sequence. This allows the model to learn contextual word representations, where meaning depends on the entire sentence context.
Transformer-Architektur¶
Encoder-Decoder-Struktur¶
The original Transformer consists of two main parts:
- Encoder - processes input sequence and creates contextual representations
- Decoder - generates output sequence based on encoded representations
Each part contains a stack of several identical layers (typically 6), with each layer having two main components: multi-head attention and feed-forward network.
Self-Attention-Mechanismus¶
The heart of the Transformer is scaled dot-product attention:
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q: Query matrix [batch_size, seq_len, d_model]
K: Key matrix [batch_size, seq_len, d_model]
V: Value matrix [batch_size, seq_len, d_model]
"""
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attention_weights = F.softmax(scores, dim=-1)
output = torch.matmul(attention_weights, V)
return output, attention_weights
The mechanism works by creating three vectors for each token - Query (what I’m looking for), Key (what I’m offering) and Value (what I’m passing). Attention score is calculated as dot product between Query and Key vectors, normalized by vector length.
Multi-Head Attention¶
Multi-head attention allows the model to track different types of relationships simultaneously:
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, query, key, value, mask=None):
batch_size = query.size(0)
# Split into heads
Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
# Attention for each head
attention, _ = scaled_dot_product_attention(Q, K, V, mask)
# Concatenate heads
attention = attention.transpose(1, 2).contiguous().view(
batch_size, -1, self.d_model
)
return self.W_o(attention)
Positional Encoding¶
Since Transformers don’t have an inherent concept of order, we must explicitly encode token positions. Sinusoidal functions are used:
def positional_encoding(seq_len, d_model):
pe = torch.zeros(seq_len, d_model)
position = torch.arange(0, seq_len).unsqueeze(1).float()
div_term = torch.exp(torch.arange(0, d_model, 2).float() *
-(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
return pe
This encoding allows the model to distinguish positions and learn distance-dependent relationships between tokens.
Varianten der Transformer-Architektur¶
BERT (Bidirectional Encoder Representations)¶
BERT uses only the encoder part and trains bidirectionally using masked language modeling:
- Randomly masks 15% of tokens in input sequence
- Learns to predict masked tokens based on entire context
- Excellent for text understanding tasks (classification, NER, QA)
GPT (Generative Pre-trained Transformer)¶
GPT uses only the decoder part with causal masking:
def create_causal_mask(seq_len):
"""Creates mask for autoregressive generation"""
mask = torch.tril(torch.ones(seq_len, seq_len))
return mask.unsqueeze(0).unsqueeze(0) # [1, 1, seq_len, seq_len]
This mask ensures that when predicting token at position i, the model sees only tokens at positions 0 through i-1.
Grundlegende Transformer-Implementierung¶
class TransformerBlock(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.attention = MultiHeadAttention(d_model, num_heads)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.feed_forward = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Linear(d_ff, d_model)
)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Self-attention with residual connection
attn_output = self.attention(x, x, x, mask)
x = self.norm1(x + self.dropout(attn_output))
# Feed-forward with residual connection
ff_output = self.feed_forward(x)
x = self.norm2(x + self.dropout(ff_output))
return x
Vor- und Nachteile¶
Vorteile:¶
- Parallelization - unlike RNN, all positions can be trained simultaneously
- Long dependencies - attention mechanism enables direct connection of distant positions
- Interpretability - attention weights provide insight into what the model focuses on
- Transfer learning - pre-trained models can be fine-tuned for specific tasks
Nachteile:¶
- Computational complexity - O(n²) with respect to sequence length
- Memory requirements - attention matrix grows quadratically
- Data requirements - requires large amounts of training data for good performance
Zusammenfassung¶
Die Transformer-Architektur stellt einen grundlegenden Fortschritt in der Sequenzverarbeitung dar. Ihre Schluesselinnovationen - Self-Attention-Mechanismus und Parallelisierbarkeit - ermoeglichten die Entwicklung fortgeschrittener Modelle wie GPT, BERT und deren Nachfolger. Fuer die Praxis ist es wichtig zu verstehen, dass verschiedene Varianten (Encoder-only, Decoder-only, Encoder-Decoder) fuer verschiedene Aufgabentypen geeignet sind. Waehrend die Implementierung komplex sein kann, sind die Prinzipien elegant und bieten eine solide Grundlage fuer das Verstaendnis moderner KI-Systeme.