Docker in Production — Lessons from the First Year

In March 2016, we deployed our first Docker container to production. A year later, we have over 40 services running in containers. Here’s what we learned — including the mistakes we wish we hadn’t made.

Why We Started with Docker¶

A classic story: “it works on my machine.” Developers had Ubuntu 16.04 on their laptops, staging ran CentOS 7, production ran Red Hat. Every deploy was an adventure. Libraries in different versions, different system dependencies, different configurations. Docker was supposed to be the solution — and it truly is. But the path wasn’t straightforward.

The initial impulse came from the development team, who wanted faster onboarding of new members. Instead of a two-day local environment setup, docker-compose up had the entire stack running in five minutes. That alone sold Docker to management.

Image Management — Where It Begins and Ends¶

Rule number one: never use the latest tag in production. It sounds trivial, but we violated it in the first months. The result? Non-reproducible builds and “but it worked yesterday.” Today, every image tag is derived from the git commit — short SHA hash plus build number.

# Bad
FROM node:latest

# Good
FROM node:8.9.4-alpine

Multi-stage builds were a game changer. Previously, our build images were 1.2 GB (Node.js apps with devDependencies, build toolchain, source code). After switching to multi-stage builds, the production image shrank to 89 MB. Smaller image = faster pull = faster deploy = smaller attack surface.

We run our own Docker registry based on Harbor. Reasons: data control (some projects are under NDA), vulnerability scanning integrated directly into the push pipeline, and garbage collection of old images. Docker Hub is fine for open source, not for enterprise.

Logging — Don’t Make Our Mistake¶

For the first months, we logged to files inside containers. Yes, exactly as bad as it sounds. Container crashed, logs gone. Debugging production issues became a nightmare.

The solution was switching to centralized logging. Applications write to stdout/stderr (12-factor app principle), the Docker log driver forwards to Fluentd, and from there to Elasticsearch. Grafana and Kibana for visualization. The entire EFK stack runs — of course — in containers.

# docker-compose.yml - logging configuration
services:
  api:
    image: registry.core.cz/api:${BUILD_SHA}
    logging:
      driver: fluentd
      options:
        fluentd-address: "localhost:24224"
        tag: "api.{{.Name}}"

An important detail: structured logging. Not console.log("Error: " + err), but JSON with context — request ID, user ID, timestamp, severity. Without it, searching through millions of log entries is like finding a needle in a haystack.

Networking — Overlay Networks and Service Discovery¶

Docker networking is an area where you learn more about TCP/IP than you’d like. Overlay networks work but have overhead. For most workloads it doesn’t matter, but for latency-sensitive services (real-time API, WebSocket), we switched to host networking.

We handle service discovery through Consul. Each container registers on startup, health checks verify availability, and other services find it via DNS. We also tried Docker’s built-in DNS, but Consul offers more — KV store, prepared queries, multi-datacenter support.

Security — A Container Is Not a VM¶

This is critical and often underestimated. A container shares the kernel with the host. If an attacker escapes the container, they have access to the entire host. Therefore:

Don’t run as root inside the container. Add USER nonroot to the Dockerfile.
Read-only filesystem where possible: --read-only flag.
Drop capabilities: --cap-drop=ALL --cap-add=NET_BIND_SERVICE — the container only needs what it actually uses.
Scan images for vulnerabilities. Harbor does this automatically via Clair. Every push goes through a scan, and images with critical CVEs don’t make it to production.
Update base images. Alpine Linux releases security patches regularly — but you have to rebuild and redeploy.

Monitoring and Health Checks¶

Docker HEALTHCHECK is a necessity, not a luxury. Without it, Docker doesn’t know if your application is actually working — it only knows the process is running. The difference is enormous. An application can be stuck in a deadlock, have a full connection pool, or be waiting on an unavailable database.

HEALTHCHECK --interval=30s --timeout=3s --retries=3 \
  CMD curl -f http://localhost:3000/health || exit 1

At the orchestration level (Docker Swarm, Kubernetes), health checks drive rolling updates and automatic restarts of unhealthy containers. Investing in a good health endpoint pays back a hundredfold.

What We’d Avoid Next Time¶

Overly large containers. Early on, we had monolithic applications in a single container. Today we understand that a container should do one thing well. Microservices architecture and containers go hand in hand.

Ignoring resource limits. A container without a memory limit can consume all the RAM on the host and cause OOM kills of other services. Always set --memory and --cpus.

Underestimating persistent storage. Docker volumes aren’t backed up automatically. Databases in containers? Yes, but with a well-thought-out storage strategy — named volumes, regular backups, tested restore procedures.

Docker in Production Is Worth the Investment¶

After a year, we have faster deploys (from hours to minutes), consistent environments from dev to production, and better hardware utilization. But it’s not free — it requires a mindset shift, new tools, and new skills on the team. If you’re considering Docker for production, start with one stateless service, learn the basics, and expand gradually.

dockercontainersproductiondevops

CORE SYSTEMS

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.