Kubernetes can run anything. That’s its strength and its problem. Without a clear production checklist, the container orchestrator becomes an incident generator. This article is 15 points we check on every cluster before marking it as production-ready. No theory — concrete configurations, YAML snippets, and explanations of why each point matters.
Why Kubernetes Deployments Fail in Production¶
Most K8s incidents we deal with aren’t related to bugs in Kubernetes itself. The problem is almost always in the configuration — in what the team skipped because it “wasn’t necessary for the dev environment.” Three reasons dominate our incident statistics.
Missing Resource Limits and Requests¶
A single pod without limits can consume all of a node’s memory and crash other workloads. In dev that doesn’t matter — in production it means a 3 AM PagerDuty alert. Without resource requests, the scheduler can’t reasonably place pods and the cluster becomes unpredictable. This is the number one cause of OOM kills we see at clients.
No Health Checks¶
Kubernetes without liveness and readiness probes doesn’t know if your application is running or just taking up space. A pod can hang in a deadlock, return no requests, but Kubernetes considers it healthy because the process still exists. Result: users get 502s and you don’t know why — after all, all pods are “Running.”
Flat Network Without Policies¶
By default in Kubernetes, every pod can communicate with every other pod. That’s convenient for development and catastrophic for security. A compromised pod in one namespace has free access to a database in another namespace. Without network policies, the entire cluster is one big flat network with zero segmentation.
These three problems share a common denominator: they arise because configuration that works in development gets copied to production without modifications. Our checklist exists to systematically close this gap.
Cluster Setup¶
4 points
1
Multi-Node Control Plane¶
A single control plane node means a single point of failure for the entire cluster. In production, we run a minimum of 3 master nodes spread across availability zones. Etcd quorum requires an odd number of members — 3 nodes tolerate one failure, 5 nodes tolerate two. On managed services (EKS, AKS, GKE), the provider handles this, but for self-managed clusters, this is the first thing we check.
2
Node Auto-Scaling with Defined Limits¶
The cluster autoscaler must have both minimum and maximum node counts set. Without an upper limit, a runaway deployment can spin up dozens of nodes and an astronomical cloud bill. Without a lower limit, the autoscaler can shrink the cluster so much that remaining nodes can’t handle even baseline traffic. We define both and add billing alerts.
3
Etcd Backup and Recovery¶
Etcd is the brain of the cluster — it contains all state. Regular etcd database snapshots are non-negotiable. We automate backups every 30 minutes to external storage (S3, GCS) and every quarter test the restore procedure. A backup nobody has tested isn’t a backup — it’s a wish.
4
Cluster Upgrade Strategy¶
Kubernetes releases a minor version every 4 months and supports the last 3 versions. Skipping two versions means a painful upgrade with breaking changes. We keep clusters at most one version behind the current stable, upgrade with a rolling strategy node by node, and always staging first, then production.
Workload Configuration¶
4 points
5
Resource Requests and Limits on Every Pod¶
Every container must declare how much CPU and memory it needs (request) and how much it may consume at most (limit). Request determines scheduling — Kubernetes places the pod on a node with enough free resources. Limit protects other workloads — exceeding the memory limit leads to an OOM kill, not crashing the neighbors.
`# Resource requests & limits — realistic values
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "1000m"
memory: "512Mi"`
6
Liveness, Readiness, and Startup Probes¶
Three types of probes, three different purposes. Liveness detects deadlocks — if it fails, Kubernetes restarts the container. Readiness controls traffic — if it fails, the pod is removed from Service endpoints. Startup protects slow applications during startup. Without them, Kubernetes can’t tell the difference between a healthy and a dead pod.
`# Probes — HTTP check with reasonable timeouts
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
failureThreshold: 2
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 5`
7
Pod Disruption Budgets¶
PDB defines how many pods must remain available during voluntary disruptions (node drain,
cluster upgrade, autoscaling). Without PDB, an upgrade can take down all replicas at once.
For critical services, we set minAvailable: 50% or maxUnavailable: 1 —
ensuring at least half the pods serve traffic even during maintenance.
8
Anti-Affinity and Topology Spread¶
All replicas on one node = one hardware failure takes down the entire service. Pod anti-affinity rules spread replicas across nodes, topology spread constraints spread them across zones. It’s insurance against correlated failures — and costs exactly zero extra.
Observability¶
4 points
9
Metrics Pipeline (Prometheus + Grafana)¶
Without metrics, you’re flying blind. Prometheus scrapes metrics from kubelet, kube-state-metrics, and your applications. Grafana visualizes them. Minimum dashboard: CPU/memory per node and pod, request rate, error rate, latency (RED metrics). Kube-prometheus-stack delivers this in a single Helm install — there’s no reason not to have it.
10
Centralized Logging¶
kubectl logs doesn’t work when the pod doesn’t exist. In production, you need logs
to survive the pod’s lifetime — we ship them to a central system (Loki, Elasticsearch, CloudWatch).
Structured logs (JSON) with correlation IDs enable tracing a request across services.
Without this, debugging a distributed application is detective work without evidence.
11
Distributed Tracing¶
In a microservices architecture, a single request passes through dozens of services. When latency increases, you need to know where. OpenTelemetry is today’s standard — you instrument applications once and send traces to Jaeger, Tempo, or Datadog. Connecting traces with logs and metrics (three pillars of observability) is what makes the difference between “it’s probably slow” and “I know exactly why.”
12
Alerting with Runbooks¶
An alert without a runbook is noise. Every alert in Alertmanager must have a link to a runbook — a document that says what the alert means, how to verify the cause, and how to resolve the problem. We alert on symptoms (error rate > 1%), not causes (CPU > 80%). And we have three severity levels: critical (wake people up), warning (resolve by morning), info (look during business hours).
Security¶
3 points
13
Network Policies¶
Default-deny incoming traffic and explicitly allow only the communication that’s needed. The database should accept connections only from backend pods, not from anything in the cluster. Network policies are the Kubernetes version of firewall rules — and should be just as strict.
`# Network Policy — default deny + allow for backend
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: db-allow-backend-only
namespace: production
spec:
podSelector:
matchLabels:
app: postgres
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: backend
ports:
- port: 5432`
14
RBAC and Least Privilege¶
Every application runs under its own ServiceAccount with minimum permissions. No cluster-admin
for CI/CD pipelines, no wildcard permissions. We define Roles and ClusterRoles granularly —
an application that reads ConfigMaps doesn’t need the right to delete Secrets. We regularly audit
permissions using kubectl auth can-i --list and remove unused roles.
15
Pod Security Standards and Image Scanning¶
Containers don’t run as root, don’t have privileged mode, and can’t escalate privileges. Pod Security Admission (successor to PodSecurityPolicy) enforces these standards at the namespace level. On top of that, image scanning in CI/CD — Trivy or Grype check CVEs in base images before deploy, not after an incident. Supply chain security starts at the image.
Anti-Patterns We See Again and Again¶
The checklist says what to do. Equally important is knowing what not to do. These anti-patterns we see at clients repeatedly — and they always cost time, money, or both.
“latest” Tag in Production¶
The :latest tag is mutable — today’s latest is different from tomorrow’s. Result: two pods with
the same deployment manifest run different code versions. Always use immutable tags — SHA digests
or semantic versions.
kubectl apply from a Laptop¶
Manual deploys bypass code review, audit trails, and rollback mechanisms. Everything to production goes through GitOps (ArgoCD, Flux) — Git is the single source of truth, nobody deploys from CLI.
Secrets in Environment Variables¶
Env vars are visible in kubectl describe pod, in logs, and in crash dumps.
Secrets belong in Kubernetes Secrets (at minimum), ideally in an external secrets manager
(Vault, AWS Secrets Manager) with rotation.
One Namespace for Everything¶
A namespace is a security and organizational boundary. Production, staging, and development in one namespace means shared RBAC, resource quotas, and network policies. Separate environments, separate teams, separate risk.
How We Do It at CORE SYSTEMS¶
We operate Kubernetes for clients in banking, logistics, and retail — environments where downtime costs real money and the regulator asks for evidence. Our approach to K8s production readiness has three phases.
Audit. We start with an automated scan of the existing configuration using a tool that checks all 15 points of this checklist plus another 30+ rules specific to the given vertical. The output is a report with a prioritized list of findings — not a generic PDF, but concrete YAML patches you can apply right away.
Implementation. We implement fixes as merge requests into the client’s GitOps repository. Every change goes through code review, is tested on the staging cluster, and deployed via ArgoCD with automatic rollback on health check failure. No manual interventions, full auditability.
Operations. After the hardening phase, we offer continuous monitoring — dashboards, alerting, and monthly compliance reports. Cluster drift (deviation from the defined state) is detected automatically and fixed before it manifests as an incident. Because the cheapest incident is the one that never happens.
Conclusion: Production Readiness Is Not a One-Time Task¶
The 15 points in this checklist are not a one-time task that you check off and forget. It’s a living standard that evolves with your application, team, and infrastructure. A cluster that was production-ready a year ago may not be production-ready today — a new service was added, traffic patterns changed, a new K8s version was released.
The key is automation. Checks in the CI/CD pipeline, policy enforcement via OPA/Kyverno, automatic alerts on drift from baseline. Humans shouldn’t manually check whether a deployment has resource limits — that’s a job for machines. Humans should think about architecture, capacity planning, and how to scale the system to the next order of magnitude.
Need help with implementation?
Our experts can help with design, implementation, and operations. From architecture to production.
Contact us