Attention Mechanism — The Key to Modern AI

The attention mechanism represents one of the most significant breakthroughs in artificial intelligence in recent years. This technology enables AI systems to selectively focus on important parts of information, similar to the human brain. Thanks to the attention mechanism, advanced models like GPT and BERT have emerged, which today power ChatGPT and other modern AI applications.

What is the Attention Mechanism?¶

The attention mechanism represents a revolutionary approach in neural networks that allows models to dynamically focus on relevant parts of input data. Instead of sequential processing like traditional RNNs or LSTMs, attention allows the model to “look at” all positions simultaneously and select the most important ones.

The basic idea is simple: when processing each element in a sequence, we compute importance scores for all other elements. These scores are then used as weights to create a contextually rich representation.

Mathematical Foundations¶

The attention mechanism can be formally described as a function that maps a query and a set of key-value pairs to an output. For position i and context C, we compute attention weights α as follows:

# Basic attention computation
def attention_weights(query, keys):
    scores = torch.matmul(query, keys.transpose(-2, -1))
    weights = torch.softmax(scores / math.sqrt(keys.size(-1)), dim=-1)
    return weights

def attention(query, keys, values):
    weights = attention_weights(query, keys)
    context = torch.matmul(weights, values)
    return context, weights

Scaling by the factor √d_k (where d_k is the dimension of keys) is critical for gradient stability at larger dimensions.

Self-Attention: Revolution in NLP¶

Self-attention represents a special case where query, keys, and values all come from the same input sequence. This approach allows each token to “communicate” with all other tokens in the sequence simultaneously.

import torch
import torch.nn as nn
import math

class SelfAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads

        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_o = nn.Linear(d_model, d_model)

    def forward(self, x):
        batch_size, seq_len = x.size(0), x.size(1)

        # Projection to Q, K, V
        Q = self.w_q(x).view(batch_size, seq_len, self.num_heads, self.head_dim)
        K = self.w_k(x).view(batch_size, seq_len, self.num_heads, self.head_dim)
        V = self.w_v(x).view(batch_size, seq_len, self.num_heads, self.head_dim)

        # Transpose for multi-head attention
        Q = Q.transpose(1, 2)  # [batch, heads, seq_len, head_dim]
        K = K.transpose(1, 2)
        V = V.transpose(1, 2)

        # Scaled dot-product attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.head_dim)
        attention_weights = torch.softmax(scores, dim=-1)
        context = torch.matmul(attention_weights, V)

        # Concatenate heads
        context = context.transpose(1, 2).contiguous().view(
            batch_size, seq_len, self.d_model
        )

        return self.w_o(context)

Multi-Head Attention¶

Multi-head attention extends the basic mechanism by running several attention “heads” in parallel with different learned projections. Each head can focus on different aspects of the input data - syntactic relationships, semantic similarities, or positional information.

Key advantages of the multi-head approach:

Parallelization: Computations across heads are independent
Diversification: Different heads capture distinct patterns
Capacity: Increased model expressivity without dramatic parameter growth

Practical Implementation with Optimizations¶

In production systems, it’s critical to optimize attention computations. Here’s an extended implementation with masking and dropout support:

class OptimizedMultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads, dropout=0.1):
        super().__init__()
        assert d_model % num_heads == 0

        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads

        self.qkv = nn.Linear(d_model, 3 * d_model)  # Combined projection
        self.output = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        batch_size, seq_len = x.shape[:2]

        # Efficient Q, K, V computation at once
        qkv = self.qkv(x).chunk(3, dim=-1)
        q, k, v = [tensor.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2) 
                   for tensor in qkv]

        # Attention scores
        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim)

        # Apply mask (for padding, causal attention, etc.)
        if mask is not None:
            scores.masked_fill_(mask == 0, float('-inf'))

        attention_weights = torch.softmax(scores, dim=-1)
        attention_weights = self.dropout(attention_weights)

        # Apply attention to values
        out = torch.matmul(attention_weights, v)
        out = out.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)

        return self.output(out), attention_weights

Attention in Transformer Architecture¶

The Transformer architecture uses the attention mechanism in three ways:

Encoder Self-Attention: Allows each word in the sequence to relate to all other words
Decoder Self-Attention: With causal masking for autoregressive generation
Cross-Attention: Connects encoder and decoder representations

# Example of causal masking for decoder
def create_causal_mask(seq_len):
    mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
    return mask == 0  # True for allowed positions

# Usage in decoder layer
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.self_attention = OptimizedMultiHeadAttention(d_model, num_heads)
        self.cross_attention = OptimizedMultiHeadAttention(d_model, num_heads)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.ReLU(),
            nn.Linear(4 * d_model, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)

    def forward(self, x, encoder_output, causal_mask=None):
        # Self-attention with causal masking
        attn_out, _ = self.self_attention(x, causal_mask)
        x = self.norm1(x + attn_out)

        # Cross-attention with encoder output
        cross_out, _ = self.cross_attention(x)  # Implementation shortened
        x = self.norm2(x + cross_out)

        # Feed-forward
        ff_out = self.feed_forward(x)
        return self.norm3(x + ff_out)

Performance Aspects and Optimizations¶

The attention mechanism has quadratic complexity O(n²) with respect to sequence length, which presents a challenge for long sequences. Modern approaches include:

Flash Attention: Memory-efficient implementation with kernel fusion
Sparse Attention: Limitation to local or structured patterns
Linear Attention: Approximation with linear complexity
Gradient Checkpointing: Trade-off between memory and computational time

# Example of sparse attention pattern
def create_local_attention_mask(seq_len, window_size=128):
    mask = torch.zeros(seq_len, seq_len, dtype=torch.bool)
    for i in range(seq_len):
        start = max(0, i - window_size // 2)
        end = min(seq_len, i + window_size // 2 + 1)
        mask[i, start:end] = True
    return mask

# Usage for long sequences
local_mask = create_local_attention_mask(4096, 256)
attention_output, weights = model.attention(x, mask=local_mask)

Summary¶

The attention mechanism represents a fundamental breakthrough in neural network architecture that enabled the emergence of modern language models. Its ability to dynamically focus on relevant parts of input while maintaining parallelizability makes it a key building block of contemporary AI systems. For practical deployment, it’s important to understand both the basic principles and optimization techniques for efficient scaling to production tasks.

attentionself-attentiontransformer

CORE SYSTEMS tým

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Všechny články