Overfitting is one of the most common problems when training neural networks. When a model learns training data too perfectly, it loses the ability to generalize to new data. Regularization techniques like Dropout, L1 and L2 regularization are proven tools for solving this problem.
What is Overfitting and Why It’s Harmful¶
Overfitting is one of the most common problems in machine learning. A model learns training data too well – it memorizes every detail including random noise, but then fails on new data. Imagine a student who memorizes a textbook by heart but doesn’t understand the principles and fails the exam.
Regularization is a set of techniques that prevent this problem. They add “friction” to training that prevents the model from over-specializing on training data. Let’s look at three most important techniques.
Dropout – Random “Turning Off” of Neurons¶
Dropout is an elegant technique that randomly deactivates part of neurons during training. It’s like randomly closing your eyes or plugging your ears while learning – you force the brain to rely on various combinations of inputs.
How Dropout Works¶
During forward pass, we randomly set selected neurons to zero with probability p (typically 0.2-0.5). The remaining neurons are multiplied by factor 1/(1-p) to preserve overall signal strength.
import torch
import torch.nn as nn
class SimpleNet(nn.Module):
def __init__(self, input_size, hidden_size, output_size, dropout_rate=0.3):
super().__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.fc2 = nn.Linear(hidden_size, hidden_size)
self.fc3 = nn.Linear(hidden_size, output_size)
self.dropout = nn.Dropout(dropout_rate)
self.relu = nn.ReLU()
def forward(self, x):
x = self.relu(self.fc1(x))
x = self.dropout(x) # Dropout after activation
x = self.relu(self.fc2(x))
x = self.dropout(x)
x = self.fc3(x)
return x
# Usage
model = SimpleNet(784, 256, 10, dropout_rate=0.4)
model.train() # Important: dropout only works in train mode
It’s crucial to remember that we only use dropout during training. During inference, we call model.eval(), which deactivates dropout.
L1 and L2 Regularization – Penalizing Large Weights¶
L1 and L2 regularization add penalties to the loss function for model weight magnitude. The principle is simple: large weights often lead to overfitting, so we “penalize” them.
L2 Regularization (Weight Decay)¶
L2 regularization adds the term λ∑w² to the loss function, where λ is the regularization coefficient. It penalizes large weights quadratically but doesn’t drive them completely to zero.
import torch.optim as optim
# L2 regularization through weight_decay in optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
# Or manually in loss function
def l2_regularization(model, lambda_reg=1e-4):
l2_reg = 0
for param in model.parameters():
l2_reg += torch.norm(param, p=2) ** 2
return lambda_reg * l2_reg
# Usage in training
criterion = nn.CrossEntropyLoss()
output = model(inputs)
loss = criterion(output, targets) + l2_regularization(model)
L1 Regularization – Creating Sparse Models¶
L1 regularization uses λ∑|w| and has an interesting property – it can zero out less important weights, creating sparse models.
def l1_regularization(model, lambda_reg=1e-4):
l1_reg = 0
for param in model.parameters():
l1_reg += torch.norm(param, p=1)
return lambda_reg * l1_reg
# Combined L1 + L2 regularization (Elastic Net)
def elastic_net_regularization(model, l1_lambda=1e-4, l2_lambda=1e-4):
l1_reg = sum(torch.norm(p, p=1) for p in model.parameters())
l2_reg = sum(torch.norm(p, p=2) ** 2 for p in model.parameters())
return l1_lambda * l1_reg + l2_lambda * l2_reg
Practical Tips for Using Regularization¶
Hyperparameter Tuning¶
Regularization strength must be carefully tuned. Too weak won’t help, too strong will “suffocate” the model and prevent learning.
- Dropout rate: Start with 0.2-0.3 for hidden layers, 0.1-0.2 for input layer
- Weight decay: Typically 1e-4 to 1e-6, depends on dataset size
- L1 regularization: Usually weaker than L2, start with 1e-5
Combining Techniques¶
class RegularizedNet(nn.Module):
def __init__(self, input_size, hidden_sizes, output_size, dropout_rates):
super().__init__()
self.layers = nn.ModuleList()
self.dropouts = nn.ModuleList()
# Input layer
prev_size = input_size
for i, (hidden_size, dropout_rate) in enumerate(zip(hidden_sizes, dropout_rates)):
self.layers.append(nn.Linear(prev_size, hidden_size))
self.dropouts.append(nn.Dropout(dropout_rate))
prev_size = hidden_size
# Output layer (no dropout)
self.output_layer = nn.Linear(prev_size, output_size)
self.relu = nn.ReLU()
def forward(self, x):
for layer, dropout in zip(self.layers, self.dropouts):
x = self.relu(layer(x))
x = dropout(x)
return self.output_layer(x)
# Model with gradually decreasing dropout
model = RegularizedNet(
input_size=784,
hidden_sizes=[512, 256, 128],
output_size=10,
dropout_rates=[0.2, 0.3, 0.4] # Higher dropout in deeper layers
)
# Optimizer with weight decay
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
Monitoring Effectiveness¶
Monitor the difference between training and validation loss. Regularization works when this difference decreases without significantly worsening validation loss.
# Monitoring during training
train_losses = []
val_losses = []
for epoch in range(num_epochs):
# Training
model.train()
train_loss = 0
for batch in train_loader:
# ... standard training cycle
pass
# Validation
model.eval()
val_loss = 0
with torch.no_grad():
for batch in val_loader:
# ... validation without gradients
pass
train_losses.append(train_loss)
val_losses.append(val_loss)
print(f"Epoch {epoch}: Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")
Summary¶
Regularization is essential for training robust models. Dropout randomly deactivates neurons and forces the model to learn redundant representations. L1/L2 regularization penalize large weights and promote simpler models. The key to success is proper hyperparameter tuning and combining techniques. Remember: a slightly underfit model that generalizes is better than a perfectly overfit model that fails in practice.