_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

Batch Normalization — stabilní trénink sítí

21. 09. 2024 4 min read intermediate

Batch Normalization ranks among the most important innovations in deep learning in recent years. This technique normalizes inputs to each layer during training, significantly stabilizing and accelerating the entire neural network learning process.

Batch Normalization — Stable Network Training

Batch Normalization (BatchNorm) is among the most important techniques in modern deep learning. It solved fundamental problems with training deep networks and enabled building models with dozens or hundreds of layers. Let’s show how it works and why it’s so effective.

The Problem of Gradient Instability

When training deep neural networks, we encounter a phenomenon called internal covariate shift. Activations in deeper layers change dramatically during training, causing:

  • Exploding/vanishing gradients
  • Slow convergence
  • Need for very careful weight initialization
  • Sensitivity to learning rate choice

BatchNorm solves these problems by normalizing activations at the mini-batch level, stabilizing the distribution of data flowing through the network.

How BatchNorm Works

Batch Normalization applies the following transformation to each feature in a mini-batch:

# For each feature x_i in the batch
μ = mean(x_i)           # Mean across batch
σ² = variance(x_i)      # Variance across batch
x̂_i = (x_i - μ) / √(σ² + ε)  # Normalization
y_i = γ * x̂_i + β      # Scaling and shift

Parameters γ (gamma) and β (beta) are learnable - the network can “learn” the optimal distribution for each layer.

Implementation in PyTorch

Using BatchNorm in PyTorch is very straightforward:

import torch
import torch.nn as nn

class ConvBlock(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, 3, padding=1)
        self.bn = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)      # BatchNorm after convolution, before activation
        x = self.relu(x)
        return x

# Usage
model = nn.Sequential(
    ConvBlock(3, 64),
    ConvBlock(64, 128),
    ConvBlock(128, 256),
)

When and Where to Apply BatchNorm

The standard placement is after the linear/convolutional layer, but before the activation function. This order has proven effective in practice, although the original paper recommended the opposite order.

# Recommended order
x = conv_layer(x)
x = batch_norm(x)
x = activation(x)

# For fully connected layers
class MLPBlock(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.linear = nn.Linear(in_features, out_features)
        self.bn = nn.BatchNorm1d(out_features)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.relu(self.bn(self.linear(x)))

Training vs. Inference Mode

BatchNorm behaves differently during training and inference. During training, it uses current batch statistics; during inference, it uses running averages:

model = YourModel()

# Training mode
model.train()
for batch in dataloader:
    # BatchNorm uses current batch statistics
    output = model(batch)

# Inference mode
model.eval()
with torch.no_grad():
    # BatchNorm uses stored running averages
    prediction = model(test_input)

BatchNorm vs. Layer Normalization

Layer Normalization (LayerNorm) is an alternative especially useful for sequences and transformers:

# BatchNorm - normalization across batch dimension
batch_norm = nn.BatchNorm1d(features)

# LayerNorm - normalization across feature dimension
layer_norm = nn.LayerNorm(features)

# Difference in behavior:
x = torch.randn(batch_size=32, features=128)

# BatchNorm: μ, σ computed across batch_size (32 samples)
bn_out = batch_norm(x)

# LayerNorm: μ, σ computed across features (128 dimensions)
ln_out = layer_norm(x)

Practical Benefits of BatchNorm

Implementing BatchNorm brings several concrete benefits:

  • Higher learning rates - network is more stable, we can train faster
  • Less dependence on initialization - BatchNorm “corrects” poor initialization
  • Regularization effect - noise from mini-batches partially replaces dropout
  • Faster convergence - typically 2-3x fewer epochs to achieve the same accuracy

Potential Problems and Solutions

BatchNorm has its pitfalls. With very small batch sizes (< 4), it can be unstable:

# For small batches, use GroupNorm or LayerNorm
import torch.nn as nn

# GroupNorm - divides channels into groups
group_norm = nn.GroupNorm(num_groups=8, num_channels=64)

# For batch_size = 1 (inference)
# BatchNorm automatically switches to stored statistics
model.eval()  # Important!

Modern Variants

Development continues with new techniques like Batch Renormalization or FRN (Filter Response Normalization), which solve specific BatchNorm problems in various scenarios.

Summary

Batch Normalization is a key technique for training deep networks. It stabilizes gradient flow, enables higher learning rates, and accelerates convergence. In CNNs, we place BatchNorm after the convolutional layer before activation; in RNNs and transformers, we prefer LayerNorm. Don’t forget to switch the model to eval() mode during inference, otherwise BatchNorm will use statistics from the last batch instead of stored running averages.

batchnormlayernormdeep learning
Share:

CORE SYSTEMS tým

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.