Transfer Learning — Leveraging Pre-trained Models

Transfer Learning is a technique that allows you to leverage knowledge learned on one task to solve another, similar problem. Instead of training a model from scratch, you can use pre-trained models and adapt them to your specific needs.

What is Transfer Learning¶

Transfer Learning represents one of the most effective techniques in modern machine learning. Instead of training a model from scratch, we utilize knowledge already learned on large datasets and adapt it to our specific problem. This approach saves time, computational resources, and often achieves better results than classical learning from the beginning.

The basic idea is simple: a model that has learned to recognize general patterns in data (such as edges, textures, or linguistic structures) can apply this knowledge to related tasks. You just need to “fine-tune” the last layers for our specific domain.

Types of Transfer Learning¶

We distinguish several main approaches:

Feature Extraction - freeze the weights of the pre-trained model and use it as a feature extractor
Fine-tuning - gradually unlock and retrain some layers on our data
Domain Adaptation - adapt the model to a new type of data (e.g., from photographs to drawings)

Feature Extraction in Practice¶

The simplest approach uses a pre-trained model as a black box for feature extraction:

import torch
import torchvision.models as models
from torch import nn

# Transfer Learning — Leveraging Pre-trained Models
base_model = models.resnet50(pretrained=True)

# Freeze all parameters
for param in base_model.parameters():
    param.requires_grad = False

# Replace classifier for our task (e.g., 10 classes)
base_model.fc = nn.Linear(base_model.fc.in_features, 10)

# Only the new layer will be trained
optimizer = torch.optim.Adam(base_model.fc.parameters(), lr=0.001)

Gradual Fine-tuning¶

A more sophisticated approach gradually “unlocks” layers for retraining:

class TransferModel(nn.Module):
    def __init__(self, num_classes, freeze_layers=True):
        super().__init__()
        self.backbone = models.resnet50(pretrained=True)

        if freeze_layers:
            # Freeze first layers
            for param in self.backbone.layer1.parameters():
                param.requires_grad = False
            for param in self.backbone.layer2.parameters():
                param.requires_grad = False

        # Modify classifier
        self.backbone.fc = nn.Linear(self.backbone.fc.in_features, num_classes)

    def unfreeze_layers(self, layer_names):
        """Gradual layer unfreezing"""
        for name in layer_names:
            layer = getattr(self.backbone, name)
            for param in layer.parameters():
                param.requires_grad = True

model = TransferModel(num_classes=10)

# After several epochs, we can unfreeze additional layers
model.unfreeze_layers(['layer2', 'layer3'])

Transfer Learning for NLP¶

In the field of natural language processing, transfer learning is even more important. Models like BERT, GPT, or RoBERTa are trained on massive text corpora and can capture complex linguistic patterns.

Fine-tuning BERT for Classification¶

from transformers import BertForSequenceClassification, BertTokenizer
from transformers import TrainingArguments, Trainer

# Load pre-trained BERT
model = BertForSequenceClassification.from_pretrained(
    'bert-base-multilingual-cased',
    num_labels=3  # E.g., sentiment: positive, negative, neutral
)

tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

# Data preparation
def tokenize_function(examples):
    return tokenizer(
        examples['text'], 
        truncation=True, 
        padding=True, 
        max_length=512
    )

train_dataset = train_dataset.map(tokenize_function, batched=True)

# Training setup
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,
    warmup_steps=500,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

trainer.train()

Best Practices¶

Learning Rate Selection¶

When fine-tuning, it’s crucial to properly set the learning rate. Generally:

For new layers: higher learning rate (1e-3 to 1e-4)
For pre-trained layers: lower learning rate (1e-5 to 1e-6)
Gradual reduction with continued training

# Differentiated learning rate for different parts of the model
def get_optimizer_grouped_parameters(model, backbone_lr=1e-5, head_lr=1e-3):
    no_decay = ["bias", "LayerNorm.weight"]

    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.backbone.named_parameters() 
                      if not any(nd in n for nd in no_decay)],
            "weight_decay": 0.01,
            "lr": backbone_lr
        },
        {
            "params": [p for n, p in model.backbone.named_parameters() 
                      if any(nd in n for nd in no_decay)],
            "weight_decay": 0.0,
            "lr": backbone_lr
        },
        {
            "params": model.fc.parameters(),
            "lr": head_lr
        }
    ]

    return torch.optim.AdamW(optimizer_grouped_parameters)

Data Augmentation and Regularization¶

With smaller datasets, it’s important to prevent overfitting:

import torchvision.transforms as transforms

# Augmentation for computer vision
transform = transforms.Compose([
    transforms.RandomRotation(15),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.RandomResizedCrop(224, scale=(0.8, 1.0)),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                        std=[0.229, 0.224, 0.225])
])

# Dropout in custom layers
class FineTunedModel(nn.Module):
    def __init__(self, base_model, num_classes):
        super().__init__()
        self.backbone = base_model
        self.dropout = nn.Dropout(0.3)
        self.classifier = nn.Linear(base_model.fc.in_features, num_classes)

    def forward(self, x):
        features = self.backbone.features(x)
        pooled = nn.AdaptiveAvgPool2d((1, 1))(features)
        flattened = torch.flatten(pooled, 1)
        dropped = self.dropout(flattened)
        return self.classifier(dropped)

Practical Tips for Successful Transfer¶

Domain similarity: The more similar the source and target domains are, the better results we can expect. A model trained on general photographs will adapt better to medical images than to satellite data.

Dataset size: For small datasets (hundreds of samples), feature extraction is a safer choice. For larger datasets (thousands of samples), we can experiment with fine-tuning.

Gradual unfreezing: Instead of immediately unlocking all layers, we gradually unfreeze from top to bottom layers:

def gradual_unfreeze_schedule(model, epoch):
    """Gradual unfreezing by epoch"""
    if epoch >= 5:
        # From epoch 5, unfreeze top layers
        for param in model.backbone.layer4.parameters():
            param.requires_grad = True

    if epoch >= 10:
        # From epoch 10, additional layers
        for param in model.backbone.layer3.parameters():
            param.requires_grad = True

    # Learning rate is also gradually reduced
    if epoch >= 5:
        for param_group in optimizer.param_groups:
            param_group['lr'] *= 0.5

Summary¶

Transfer Learning represents a fundamental change in the approach to machine learning. Instead of training models from scratch, we leverage collective “knowledge” stored in pre-trained models. The key to success is the proper choice of strategy (feature extraction vs. fine-tuning), careful setting of learning rates for different parts of the model, and a gradual approach to unfreezing layers. With the growth of pre-trained models such as foundation models, transfer learning is becoming an even more important tool for efficient AI application development.

transfer learningpre-trainingfine-tuning

CORE SYSTEMS team

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

All articles