OTA Updates
Updates without bricked devices.
Staged rollout, A/B partitioning, differential updates, automatic rollback. Never update the entire fleet at once.
Why OTA is critical infrastructure¶
A bad OTA update can brick thousands of devices at once. And unlike a server, you cannot SSH-restart IoT devices in the field. You cannot drive to them in an hour. Some are on rooftops, in tunnels, in shipping containers at sea.
OTA update infrastructure for IoT requires a paranoid approach. Every update is a potential brick. Every rollout is a controlled experiment. Every step has a fallback.
We build OTA systems where no device can be permanently bricked and no rollout proceeds without validation.
A/B Partition Schema¶
Principle¶
The device has two boot slots: Partition A (active) and Partition B (inactive):
- Current firmware runs on Partition A
- OTA update is downloaded and written to Partition B
- Bootloader switches to Partition B and restarts
- New firmware passes the boot test (health check)
- Success: Partition B becomes active, A is the backup
- Failure: Bootloader automatically reverts to Partition A
Result: The device always has working firmware. Even in the worst case (corrupted update, incompatible firmware, hardware regression) it returns to the previous state.
Boot test¶
After the first boot on the new partition, validation runs:
- Hardware check: All sensors and peripherals respond
- Connectivity check: Connection to backend successful
- Application check: Main application starts and reports healthy
- Watchdog: If the system does not confirm health within 120 seconds, the watchdog forces a restart on the old partition
Only after a successful boot test is the new partition marked as “confirmed”. Without confirmation = automatic rollback on the next restart.
Staged Rollout¶
Rollout phases¶
We never update the entire fleet at once. Staged rollout minimises blast radius:
1. Internal testing (0.1%) Dev and QA team on their own devices. Smoke tests, manual validation. Gate: zero critical bugs.
2. Canary (1-5%) Small group of production devices, typically in one location. Automatic monitoring: crash rate, telemetry health, error rate. Gate: metrics same or better than baseline.
3. Early adopters (10-20%) Wider group across locations and hardware variants. Business KPI monitoring — does it work as expected? Gate: no regression in business metrics.
4. General availability (50% → 100%) Gradual expansion to the entire fleet. Monitoring continues. Halt at any time if an anomaly is detected.
Automatic evaluation¶
Automatic gate between each phase:
IF crash_rate > baseline * 1.05 THEN halt_rollout
IF telemetry_gap > 5min THEN halt_rollout
IF error_rate > 2% THEN rollback_stage
IF business_kpi < threshold THEN halt_and_investigate
The gate is automatic — no person needs to approve at 3am. A person only intervenes on halt (investigate) or for an override.
Differential Updates¶
The problem¶
Full firmware image for embedded Linux: 200-500 MB. Over a mobile network (LTE-M, NB-IoT) this takes tens of minutes and costs money. Over satellite it is impractical.
The solution¶
A differential update sends only the changes from the current version:
Binary diff (bsdiff/xdelta): Comparison of old and new image at binary level. Typical compression: 60-80%. 500 MB image → 100-200 MB diff.
Mender: Open-source OTA platform for embedded Linux (Yocto, Debian). A/B updates, delta updates, device management. Self-hosted or SaaS.
RAUC: Lightweight update framework for embedded Linux. Bundle-based updates, A/B slot management, custom update handlers. Lower overhead than Mender.
SWUpdate: Flexible update agent. Supports various update strategies, scripted updates, recovery mode.
Resumable downloads¶
An interrupted download continues from where it left off:
- HTTP Range requests for partial download
- Checksum per chunk (not just the whole file)
- Corrupted chunk re-downloaded, not the entire update
- Download runs in the background with no impact on normal operation
Security¶
Firmware signing¶
Every firmware image is digitally signed:
- Build server creates the image
- Signing server (HSM-backed) signs the image using a private key
- Device verifies the signature using an embedded public key before installation
- Invalid signature = update rejected, alert to management console
Secure boot chain: Bootloader → kernel → rootfs → application. Each step verifies the signature of the next. Compromising the application cannot modify the kernel.
Anti-rollback¶
The device tracks a monotonic counter. Every firmware has a version number. A firmware with a lower version number than the current one cannot be installed. Protection against downgrade attacks — an attacker cannot install an older version with a known vulnerability.
Encryption¶
Firmware image encrypted in transit (TLS) and at rest on the device (AES-256). Decryption key in a secure enclave (TPM, TEE). IP protection — firmware cannot be extracted and analysed from a stolen device.
Scheduling and constraints¶
Not every moment is suitable for an update:
- Maintenance window: Updates only outside peak hours (production lines: night shift)
- Battery level: At least 30% battery, or connected to power
- Connectivity: Wi-Fi preferred. Cellular only for critical patches. No updates over NB-IoT (too slow)
- User consent: Consumer devices — notification and confirmation. Industrial devices — automatic during maintenance window
- Dependencies: If firmware requires a new backend version, the backend updates first
Fleet-wide coordination¶
We never update all devices at the same location simultaneously. Staggered rollout per location — if one location has a problem, the others are not affected.
Monitoring and reporting¶
OTA dashboard¶
- Rollout progress: How many devices updated, pending, failed
- Version distribution: Pie chart — how many devices on which version
- Health metrics: Crash rate, reboot count, connectivity post-update
- Error analysis: Top failure reasons, affected device types, firmware versions
Alerting¶
- Rollout success rate drops below 98% → halt + alert
- Reboot loop detected (3+ restarts in 10 min) → automatic rollback + alert
- Device offline after update for more than 30 min → alert
Technology stack¶
OTA platforms: Mender, RAUC, SWUpdate, Hawkbit, Azure IoT Hub, AWS IoT Jobs.
Diff tools: bsdiff, xdelta3, zstd compression.
Security: OpenSSL, wolfSSL, TPM 2.0, PKCS#11, secure boot (U-Boot verified boot).
Monitoring: Grafana, Prometheus, custom OTA analytics dashboard.
Build: Yocto/BitBake, Buildroot, custom CI pipeline (GitHub Actions, GitLab CI).
Časté otázky
A/B partition schema: new firmware on partition B, old on A. If the new firmware does not pass the boot test, the device automatically reverts to partition A. No brick, no manual intervention.
Full image: tens to hundreds of MB. Differential update: 5-40% of the full image. For embedded Linux (Yocto) typically 20-100 MB differential vs. 200-500 MB full.
Depends on the staging strategy and fleet size. Typically 1-2 weeks for staged rollout (canary → early adopters → GA). Emergency patch: hours with an accelerated staging.
Yes. Differential updates minimise data transfer. Resumable downloads — an interrupted download continues from where it left off. Scheduling prefers Wi-Fi; cellular only for critical patches.