OTA Updates

Q: What happens when an OTA update fails?

A/B partition schema: new firmware on partition B, old on A. If the new firmware does not pass the boot test, the device automatically reverts to partition A. No brick, no manual intervention.

Q: How large is a typical OTA update?

Full image: tens to hundreds of MB. Differential update: 5-40% of the full image. For embedded Linux (Yocto) typically 20-100 MB differential vs. 200-500 MB full.

Q: How long does a rollout to the entire fleet take?

Depends on the staging strategy and fleet size. Typically 1-2 weeks for staged rollout (canary → early adopters → GA). Emergency patch: hours with an accelerated staging.

Q: Do you support updates over mobile networks?

Yes. Differential updates minimise data transfer. Resumable downloads — an interrupted download continues from where it left off. Scheduling prefers Wi-Fi; cellular only for critical patches.

Updates without bricked devices.

Staged rollout, A/B partitioning, differential updates, automatic rollback. Never update the entire fleet at once.

I need an OTA solution Back to IoT & Automation

99.8%

OTA success rate

60-80%

Update size reduction

<60s

Rollback time

✓

Zero bricked devices

Why OTA is critical infrastructure¶

A bad OTA update can brick thousands of devices at once. And unlike a server, you cannot SSH-restart IoT devices in the field. You cannot drive to them in an hour. Some are on rooftops, in tunnels, in shipping containers at sea.

OTA update infrastructure for IoT requires a paranoid approach. Every update is a potential brick. Every rollout is a controlled experiment. Every step has a fallback.

We build OTA systems where no device can be permanently bricked and no rollout proceeds without validation.

A/B Partition Schema¶

Principle¶

The device has two boot slots: Partition A (active) and Partition B (inactive):

Current firmware runs on Partition A
OTA update is downloaded and written to Partition B
Bootloader switches to Partition B and restarts
New firmware passes the boot test (health check)
Success: Partition B becomes active, A is the backup
Failure: Bootloader automatically reverts to Partition A

Result: The device always has working firmware. Even in the worst case (corrupted update, incompatible firmware, hardware regression) it returns to the previous state.

Boot test¶

After the first boot on the new partition, validation runs:

Hardware check: All sensors and peripherals respond
Connectivity check: Connection to backend successful
Application check: Main application starts and reports healthy
Watchdog: If the system does not confirm health within 120 seconds, the watchdog forces a restart on the old partition

Only after a successful boot test is the new partition marked as “confirmed”. Without confirmation = automatic rollback on the next restart.

Staged Rollout¶

Rollout phases¶

We never update the entire fleet at once. Staged rollout minimises blast radius:

1. Internal testing (0.1%) Dev and QA team on their own devices. Smoke tests, manual validation. Gate: zero critical bugs.

2. Canary (1-5%) Small group of production devices, typically in one location. Automatic monitoring: crash rate, telemetry health, error rate. Gate: metrics same or better than baseline.

3. Early adopters (10-20%) Wider group across locations and hardware variants. Business KPI monitoring — does it work as expected? Gate: no regression in business metrics.

4. General availability (50% → 100%) Gradual expansion to the entire fleet. Monitoring continues. Halt at any time if an anomaly is detected.

Automatic evaluation¶

Automatic gate between each phase:

IF crash_rate > baseline * 1.05 THEN halt_rollout
IF telemetry_gap > 5min THEN halt_rollout  
IF error_rate > 2% THEN rollback_stage
IF business_kpi < threshold THEN halt_and_investigate

The gate is automatic — no person needs to approve at 3am. A person only intervenes on halt (investigate) or for an override.

Differential Updates¶

The problem¶

Full firmware image for embedded Linux: 200-500 MB. Over a mobile network (LTE-M, NB-IoT) this takes tens of minutes and costs money. Over satellite it is impractical.

The solution¶

A differential update sends only the changes from the current version:

Binary diff (bsdiff/xdelta): Comparison of old and new image at binary level. Typical compression: 60-80%. 500 MB image → 100-200 MB diff.

Mender: Open-source OTA platform for embedded Linux (Yocto, Debian). A/B updates, delta updates, device management. Self-hosted or SaaS.

RAUC: Lightweight update framework for embedded Linux. Bundle-based updates, A/B slot management, custom update handlers. Lower overhead than Mender.

SWUpdate: Flexible update agent. Supports various update strategies, scripted updates, recovery mode.

Resumable downloads¶

An interrupted download continues from where it left off:

HTTP Range requests for partial download
Checksum per chunk (not just the whole file)
Corrupted chunk re-downloaded, not the entire update
Download runs in the background with no impact on normal operation

Security¶

Firmware signing¶

Every firmware image is digitally signed:

Build server creates the image
Signing server (HSM-backed) signs the image using a private key
Device verifies the signature using an embedded public key before installation
Invalid signature = update rejected, alert to management console

Secure boot chain: Bootloader → kernel → rootfs → application. Each step verifies the signature of the next. Compromising the application cannot modify the kernel.

Anti-rollback¶

The device tracks a monotonic counter. Every firmware has a version number. A firmware with a lower version number than the current one cannot be installed. Protection against downgrade attacks — an attacker cannot install an older version with a known vulnerability.

Encryption¶

Firmware image encrypted in transit (TLS) and at rest on the device (AES-256). Decryption key in a secure enclave (TPM, TEE). IP protection — firmware cannot be extracted and analysed from a stolen device.

Scheduling and constraints¶

Not every moment is suitable for an update:

Maintenance window: Updates only outside peak hours (production lines: night shift)
Battery level: At least 30% battery, or connected to power
Connectivity: Wi-Fi preferred. Cellular only for critical patches. No updates over NB-IoT (too slow)
User consent: Consumer devices — notification and confirmation. Industrial devices — automatic during maintenance window
Dependencies: If firmware requires a new backend version, the backend updates first

Fleet-wide coordination¶

We never update all devices at the same location simultaneously. Staggered rollout per location — if one location has a problem, the others are not affected.

Monitoring and reporting¶

OTA dashboard¶

Rollout progress: How many devices updated, pending, failed
Version distribution: Pie chart — how many devices on which version
Health metrics: Crash rate, reboot count, connectivity post-update
Error analysis: Top failure reasons, affected device types, firmware versions

Alerting¶

Rollout success rate drops below 98% → halt + alert
Reboot loop detected (3+ restarts in 10 min) → automatic rollback + alert
Device offline after update for more than 30 min → alert

Technology stack¶

OTA platforms: Mender, RAUC, SWUpdate, Hawkbit, Azure IoT Hub, AWS IoT Jobs.

Diff tools: bsdiff, xdelta3, zstd compression.

Security: OpenSSL, wolfSSL, TPM 2.0, PKCS#11, secure boot (U-Boot verified boot).

Monitoring: Grafana, Prometheus, custom OTA analytics dashboard.

Build: Yocto/BitBake, Buildroot, custom CI pipeline (GitHub Actions, GitLab CI).

Časté otázky

A/B partition schema: new firmware on partition B, old on A. If the new firmware does not pass the boot test, the device automatically reverts to partition A. No brick, no manual intervention.

Full image: tens to hundreds of MB. Differential update: 5-40% of the full image. For embedded Linux (Yocto) typically 20-100 MB differential vs. 200-500 MB full.

Depends on the staging strategy and fleet size. Typically 1-2 weeks for staged rollout (canary → early adopters → GA). Emergency patch: hours with an accelerated staging.

Yes. Differential updates minimise data transfer. Resumable downloads — an interrupted download continues from where it left off. Scheduling prefers Wi-Fi; cellular only for critical patches.

Souvisí s

IoT, Automation & Robotics {'cs': 'Průmyslové IoT, edge computing, robotická automatizace.', 'en': 'Industrial IoT, edge computing, robotic automation.'}

Security & Compliance {'cs': 'Zero Trust, IAM, audit, compliance.', 'en': 'Zero Trust, IAM, audit, compliance.'}

Cloud & Platform Engineering {'cs': 'Kubernetes, IaC, CI/CD a provoz v cloudu.', 'en': 'Kubernetes, IaC, CI/CD and cloud operations.'}

Logistics & E-commerce {'cs': 'Supply chain, WMS, fulfillment automatizace', 'en': 'Supply chain, WMS, fulfillment automation'}

Máte projekt?

Pojďme si o něm promluvit.

Domluvit schůzku