Optimizers — Adam, SGD and How to Choose the Right One

Choosing the right optimizer is crucial for successful neural network training. Adam and SGD are the most popular algorithms, each with their own advantages and disadvantages for different types of tasks.

What Are Optimizers and Why Are They Important¶

Optimizers are the heart of every machine learning algorithm. Their job is to find such values of model parameters that minimize the loss function. Think of it as navigating mountainous terrain where you’re trying to find the lowest point (global minimum) – each optimizer uses a different strategy to reach this goal.

In practice, you’ll most commonly encounter three main approaches: classic Stochastic Gradient Descent (SGD), adaptive Adam optimizer, and its improved version AdamW. Each has its place, and the right choice can mean the difference between a model that converges in hours and one that trains for days.

SGD: Simple and Proven¶

Stochastic Gradient Descent is the most basic and simultaneously most sensitive optimizer. It works on the principle of gradual movement in the direction of steepest gradient descent. Its main advantage is simplicity and predictable behavior.

import torch
import torch.optim as optim

# Optimizers — Adam, SGD and How to Choose the Right One
optimizer = optim.SGD(model.parameters(), 
                     lr=0.01,           # learning rate
                     momentum=0.9,      # momentum for stability
                     weight_decay=1e-4) # L2 regularization

# Training loop
for batch in dataloader:
    optimizer.zero_grad()
    loss = criterion(model(batch.input), batch.target)
    loss.backward()
    optimizer.step()

SGD requires careful learning rate tuning. Too high a value causes oscillations around the minimum, too low causes slow convergence. The momentum parameter helps overcome local minima and stabilizes training. In practice, start with lr=0.01 and momentum=0.9, then experiment.

When to Use SGD¶

Computer vision models – SGD with momentum often outperforms Adam on CNN architectures
Long training – SGD has better generalization capabilities with sufficient epochs
Large batch size – SGD scales better with large batch sizes than adaptive optimizers

Adam: Adaptive and User-Friendly¶

Adam (Adaptive Moment Estimation) combines the advantages of momentum with adaptive learning rate for each parameter. It maintains exponentially decaying averages of gradients (first moment) and their squared values (second moment), allowing it to automatically adapt step size.

# Adam optimizer with typical parameters
optimizer = optim.Adam(model.parameters(),
                      lr=1e-3,           # usually 0.001
                      betas=(0.9, 0.999), # exponential decay rates
                      eps=1e-8,          # numerical stability
                      weight_decay=0)     # L2 regularization

# Learning rate scheduling with Adam
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.5, patience=5
)

for epoch in range(epochs):
    train_loss = train_epoch(model, optimizer, dataloader)
    val_loss = validate(model, val_dataloader)
    scheduler.step(val_loss)

Adam is excellent for quick starts – it often converges faster than SGD in the first epochs. Its adaptive nature means it works reasonably well even with default parameters, making it a popular choice for prototyping.

The Generalization Problem¶

However, Adam has one fundamental problem: it can lead to worse generalization than SGD. The reason is its tendency to “get stuck” in sharp minima that fit training data well but generalize poorly to new data.

AdamW: Best of Both Worlds¶

AdamW (Adam with decoupled Weight decay) solves the main problem of standard Adam optimizer by separating weight decay from gradient-based updates. The result is better regularization and often better generalization.

# AdamW with recommended parameters
optimizer = optim.AdamW(model.parameters(),
                       lr=1e-3,
                       betas=(0.9, 0.999),
                       eps=1e-8,
                       weight_decay=0.01)  # stronger weight decay

# Warmup schedule often used with AdamW
def get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps):
    def lr_lambda(current_step):
        if current_step < num_warmup_steps:
            return float(current_step) / float(max(1, num_warmup_steps))
        return max(0.0, float(num_training_steps - current_step) / 
                  float(max(1, num_training_steps - num_warmup_steps)))

    return optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

AdamW has become the standard in NLP, especially when training transformer models. Its combination of fast convergence with better regularization makes it a universal choice for most modern deep learning tasks.

Practical Tips for Optimizer Selection¶

By Task Type¶

Computer Vision: Start with SGD + momentum. For ResNet and similar architectures, you’ll often achieve better results than with Adam.

NLP and Transformers: AdamW is the clear choice. Combine with warmup scheduling for optimal results.

Prototyping: Adam for quick experiments, AdamW for final models.

Hyperparameter Tuning¶

# Typical ranges for grid search
sgd_params = {
    'lr': [0.1, 0.01, 0.001],
    'momentum': [0.9, 0.95, 0.99],
    'weight_decay': [1e-4, 1e-3, 5e-3]
}

adam_params = {
    'lr': [1e-2, 1e-3, 1e-4],
    'weight_decay': [0, 1e-4, 1e-3]  # for Adam
}

adamw_params = {
    'lr': [1e-3, 5e-4, 1e-4],
    'weight_decay': [0.01, 0.05, 0.1]  # higher for AdamW
}

Don’t forget about learning rate scheduling – it can significantly improve convergence for all optimizers. Cosine annealing or step decay are good starting points.

Monitoring and Debugging¶

Monitor not just loss, but also gradient and parameter norms. Exploding gradients indicate too high learning rate, while vanishing gradients may mean the opposite or an architecture problem.

# Gradient monitoring
total_norm = 0
for p in model.parameters():
    if p.grad is not None:
        param_norm = p.grad.data.norm(2)
        total_norm += param_norm.item() ** 2
total_norm = total_norm ** (1. / 2)

print(f'Gradient norm: {total_norm:.4f}')

Summary¶

Choosing the right optimizer depends on your specific task. SGD with momentum remains king for computer vision tasks with long training, Adam is great for quick prototyping, while AdamW combines Adam’s speed with better generalization and has become the standard for transformer models. The key to success isn’t just optimizer selection, but also proper learning rate setting and scheduling strategy.

adamsgdadamw

CORE SYSTEMS team

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

All articles