Transfer Learning is a technique that allows you to leverage knowledge learned on one task to solve another, similar problem. Instead of training a model from scratch, you can use pre-trained models and adapt them to your specific needs.
What is Transfer Learning¶
Transfer Learning represents one of the most effective techniques in modern machine learning. Instead of training a model from scratch, we utilize knowledge already learned on large datasets and adapt it to our specific problem. This approach saves time, computational resources, and often achieves better results than classical learning from the beginning.
The basic idea is simple: a model that has learned to recognize general patterns in data (such as edges, textures, or linguistic structures) can apply this knowledge to related tasks. You just need to “fine-tune” the last layers for our specific domain.
Types of Transfer Learning¶
We distinguish several main approaches:
- Feature Extraction - freeze the weights of the pre-trained model and use it as a feature extractor
- Fine-tuning - gradually unlock and retrain some layers on our data
- Domain Adaptation - adapt the model to a new type of data (e.g., from photographs to drawings)
Feature Extraction in Practice¶
The simplest approach uses a pre-trained model as a black box for feature extraction:
import torch
import torchvision.models as models
from torch import nn
# Load pre-trained ResNet
base_model = models.resnet50(pretrained=True)
# Freeze all parameters
for param in base_model.parameters():
param.requires_grad = False
# Replace classifier for our task (e.g., 10 classes)
base_model.fc = nn.Linear(base_model.fc.in_features, 10)
# Only the new layer will be trained
optimizer = torch.optim.Adam(base_model.fc.parameters(), lr=0.001)
Gradual Fine-tuning¶
A more sophisticated approach gradually “unlocks” layers for retraining:
class TransferModel(nn.Module):
def __init__(self, num_classes, freeze_layers=True):
super().__init__()
self.backbone = models.resnet50(pretrained=True)
if freeze_layers:
# Freeze first layers
for param in self.backbone.layer1.parameters():
param.requires_grad = False
for param in self.backbone.layer2.parameters():
param.requires_grad = False
# Modify classifier
self.backbone.fc = nn.Linear(self.backbone.fc.in_features, num_classes)
def unfreeze_layers(self, layer_names):
"""Gradual layer unfreezing"""
for name in layer_names:
layer = getattr(self.backbone, name)
for param in layer.parameters():
param.requires_grad = True
model = TransferModel(num_classes=10)
# After several epochs, we can unfreeze additional layers
model.unfreeze_layers(['layer2', 'layer3'])
Transfer Learning for NLP¶
In the field of natural language processing, transfer learning is even more important. Models like BERT, GPT, or RoBERTa are trained on massive text corpora and can capture complex linguistic patterns.
Fine-tuning BERT for Classification¶
from transformers import BertForSequenceClassification, BertTokenizer
from transformers import TrainingArguments, Trainer
# Load pre-trained BERT
model = BertForSequenceClassification.from_pretrained(
'bert-base-multilingual-cased',
num_labels=3 # E.g., sentiment: positive, negative, neutral
)
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
# Data preparation
def tokenize_function(examples):
return tokenizer(
examples['text'],
truncation=True,
padding=True,
max_length=512
)
train_dataset = train_dataset.map(tokenize_function, batched=True)
# Training setup
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=16,
learning_rate=2e-5,
warmup_steps=500,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
trainer.train()
Best Practices¶
Learning Rate Selection¶
When fine-tuning, it’s crucial to properly set the learning rate. Generally:
- For new layers: higher learning rate (1e-3 to 1e-4)
- For pre-trained layers: lower learning rate (1e-5 to 1e-6)
- Gradual reduction with continued training
# Differentiated learning rate for different parts of the model
def get_optimizer_grouped_parameters(model, backbone_lr=1e-5, head_lr=1e-3):
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
{
"params": [p for n, p in model.backbone.named_parameters()
if not any(nd in n for nd in no_decay)],
"weight_decay": 0.01,
"lr": backbone_lr
},
{
"params": [p for n, p in model.backbone.named_parameters()
if any(nd in n for nd in no_decay)],
"weight_decay": 0.0,
"lr": backbone_lr
},
{
"params": model.fc.parameters(),
"lr": head_lr
}
]
return torch.optim.AdamW(optimizer_grouped_parameters)
Data Augmentation and Regularization¶
With smaller datasets, it’s important to prevent overfitting:
import torchvision.transforms as transforms
# Augmentation for computer vision
transform = transforms.Compose([
transforms.RandomRotation(15),
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(brightness=0.2, contrast=0.2),
transforms.RandomResizedCrop(224, scale=(0.8, 1.0)),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
# Dropout in custom layers
class FineTunedModel(nn.Module):
def __init__(self, base_model, num_classes):
super().__init__()
self.backbone = base_model
self.dropout = nn.Dropout(0.3)
self.classifier = nn.Linear(base_model.fc.in_features, num_classes)
def forward(self, x):
features = self.backbone.features(x)
pooled = nn.AdaptiveAvgPool2d((1, 1))(features)
flattened = torch.flatten(pooled, 1)
dropped = self.dropout(flattened)
return self.classifier(dropped)
Practical Tips for Successful Transfer¶
Domain similarity: The more similar the source and target domains are, the better results we can expect. A model trained on general photographs will adapt better to medical images than to satellite data.
Dataset size: For small datasets (hundreds of samples), feature extraction is a safer choice. For larger datasets (thousands of samples), we can experiment with fine-tuning.
Gradual unfreezing: Instead of immediately unlocking all layers, we gradually unfreeze from top to bottom layers:
def gradual_unfreeze_schedule(model, epoch):
"""Gradual unfreezing by epoch"""
if epoch >= 5:
# From epoch 5, unfreeze top layers
for param in model.backbone.layer4.parameters():
param.requires_grad = True
if epoch >= 10:
# From epoch 10, additional layers
for param in model.backbone.layer3.parameters():
param.requires_grad = True
# Learning rate is also gradually reduced
if epoch >= 5:
for param_group in optimizer.param_groups:
param_group['lr'] *= 0.5
Summary¶
Transfer Learning represents a fundamental change in the approach to machine learning. Instead of training models from scratch, we leverage collective “knowledge” stored in pre-trained models. The key to success is the proper choice of strategy (feature extraction vs. fine-tuning), careful setting of learning rates for different parts of the model, and a gradual approach to unfreezing layers. With the growth of pre-trained models such as foundation models, transfer learning is becoming an even more important tool for efficient AI application development.