Kubernetes can scale. But properly configuring autoscaling to respond to real load, not waste resources, and not collapse under peak traffic — that’s an art.
Three layers of autoscaling¶
- HPA — adds/removes pods (for stateless services)
- VPA — changes CPU/RAM limits of pods (for monoliths)
- Cluster Autoscaler — adds/removes nodes
Custom metrics instead of CPU¶
Default HPA scales based on CPU, but that’s not enough. Through Prometheus Adapter we added requests/sec, latency p95 and queue depth. Now HPA scales based on what really matters.
Overprovisioning for fast scale-up¶
A new AKS node takes 3-5 minutes. Solution: we maintain an “empty” node with pause containers, immediately available for real workloads. Cluster Autoscaler adds a new node in the background.
Spot instances — 60-80% savings¶
For fault-tolerant workloads (batch, CI/CD, dev) we use Azure Spot VMs in a dedicated node pool. Production always on on-demand.
Biggest mistake: wrong resource requests¶
Developers set 2 CPU and 4 GB RAM “just to be safe”. Real utilization 15%. Cluster Autoscaler was adding nodes unnecessarily. Solution: VPA in recommendation mode.
Autoscaling requires investment¶
It’s not “set it and forget it”. Proper metrics, realistic requests and continuous tuning — but the reward is a system that handles peaks automatically.
Need help with implementation?
Our experts can help with design, implementation, and operations. From architecture to production.
Contact us