Transformer-Architektur - Ein vollstaendiger Leitfaden

Transformer architecture represents breakthrough technology in artificial intelligence that powers the most modern language models like GPT or BERT. This guide will simply explain how key mechanisms like attention and self-attention work, and why Transformers are so effective.

Was ist die Transformer-Architektur¶

Transformer architecture represents a revolutionary approach to sequence processing that has dominated Natural Language Processing since 2017. The key innovation is the self-attention mechanism, which allows the model to track relationships between all positions in a sequence simultaneously, unlike sequential processing in RNN or LSTM networks.

The basic principle lies in transforming input token sequences into vectors using an attention mechanism that weights the importance of individual positions for each element in the sequence. This allows the model to learn contextual word representations, where meaning depends on the entire sentence context.

Transformer-Architektur¶

Encoder-Decoder-Struktur¶

The original Transformer consists of two main parts:

Encoder - processes input sequence and creates contextual representations
Decoder - generates output sequence based on encoded representations

Each part contains a stack of several identical layers (typically 6), with each layer having two main components: multi-head attention and feed-forward network.

Self-Attention-Mechanismus¶

The heart of the Transformer is scaled dot-product attention:

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q: Query matrix [batch_size, seq_len, d_model]
    K: Key matrix [batch_size, seq_len, d_model]
    V: Value matrix [batch_size, seq_len, d_model]
    """
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)

    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)

    attention_weights = F.softmax(scores, dim=-1)
    output = torch.matmul(attention_weights, V)

    return output, attention_weights

The mechanism works by creating three vectors for each token - Query (what I’m looking for), Key (what I’m offering) and Value (what I’m passing). Attention score is calculated as dot product between Query and Key vectors, normalized by vector length.

Multi-Head Attention¶

Multi-head attention allows the model to track different types of relationships simultaneously:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # Split into heads
        Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # Attention for each head
        attention, _ = scaled_dot_product_attention(Q, K, V, mask)

        # Concatenate heads
        attention = attention.transpose(1, 2).contiguous().view(
            batch_size, -1, self.d_model
        )

        return self.W_o(attention)

Positional Encoding¶

Since Transformers don’t have an inherent concept of order, we must explicitly encode token positions. Sinusoidal functions are used:

def positional_encoding(seq_len, d_model):
    pe = torch.zeros(seq_len, d_model)
    position = torch.arange(0, seq_len).unsqueeze(1).float()

    div_term = torch.exp(torch.arange(0, d_model, 2).float() *
                        -(math.log(10000.0) / d_model))

    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)

    return pe

This encoding allows the model to distinguish positions and learn distance-dependent relationships between tokens.

Varianten der Transformer-Architektur¶

BERT (Bidirectional Encoder Representations)¶

BERT uses only the encoder part and trains bidirectionally using masked language modeling:

Randomly masks 15% of tokens in input sequence
Learns to predict masked tokens based on entire context
Excellent for text understanding tasks (classification, NER, QA)

GPT (Generative Pre-trained Transformer)¶

GPT uses only the decoder part with causal masking:

def create_causal_mask(seq_len):
    """Creates mask for autoregressive generation"""
    mask = torch.tril(torch.ones(seq_len, seq_len))
    return mask.unsqueeze(0).unsqueeze(0)  # [1, 1, seq_len, seq_len]

This mask ensures that when predicting token at position i, the model sees only tokens at positions 0 through i-1.

Grundlegende Transformer-Implementierung¶

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Self-attention with residual connection
        attn_output = self.attention(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))

        # Feed-forward with residual connection
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))

        return x

Vor- und Nachteile¶

Vorteile:¶

Parallelization - unlike RNN, all positions can be trained simultaneously
Long dependencies - attention mechanism enables direct connection of distant positions
Interpretability - attention weights provide insight into what the model focuses on
Transfer learning - pre-trained models can be fine-tuned for specific tasks

Nachteile:¶

Computational complexity - O(n²) with respect to sequence length
Memory requirements - attention matrix grows quadratically
Data requirements - requires large amounts of training data for good performance

Zusammenfassung¶

Die Transformer-Architektur stellt einen grundlegenden Fortschritt in der Sequenzverarbeitung dar. Ihre Schluesselinnovationen - Self-Attention-Mechanismus und Parallelisierbarkeit - ermoeglichten die Entwicklung fortgeschrittener Modelle wie GPT, BERT und deren Nachfolger. Fuer die Praxis ist es wichtig zu verstehen, dass verschiedene Varianten (Encoder-only, Decoder-only, Encoder-Decoder) fuer verschiedene Aufgabentypen geeignet sind. Waehrend die Implementierung komplex sein kann, sind die Prinzipien elegant und bieten eine solide Grundlage fuer das Verstaendnis moderner KI-Systeme.

transformergptbert

CORE SYSTEMS Team

Wir bauen Kernsysteme und KI-Agenten, die den Betrieb am Laufen halten. 15 Jahre Erfahrung mit Enterprise-IT.

Alle Artikel