_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

Monitoring & Predictive Maintenance

Fix it before it breaks. Not after.

Continuous monitoring, anomaly detection, Remaining Useful Life prediction. Reduce unplanned outages by 70%.

-70%
Unplanned outages
-25%
Maintenance costs
+20%
Equipment lifetime
<12 months
ROI

From reactive to predictive maintenance

Most industrial companies run maintenance reactively or on a time-based schedule. Both are suboptimal:

Reactive maintenance: Machine breaks → we fix it. Unplanned outage, express parts, overtime. The most expensive approach — an unplanned outage costs 3-10× more than a planned one.

Time-based maintenance: Replace the bearing every 6 months. Regardless of actual condition. We replace parts that still work (waste). And we are still surprised by failures between intervals.

Predictive maintenance: Continuous condition monitoring. Maintenance when data says “there will be a problem in 2 weeks” — not sooner (waste), not later (failure). Optimal timing, minimal outages.

ROI in numbers

Industry consensus (McKinsey, Deloitte):

  • 25-30% reduction in maintenance costs
  • 70-75% reduction in unplanned outages
  • 20-25% extension of equipment lifetime
  • 35-45% reduction in spare parts inventory

Unplanned production line outage: significant cost per hour (depends on industry). One predicted and prevented outage pays for the project.

Condition monitoring

Sensors and measured quantities

Vibration — most sensitive indicator of mechanical condition: - Accelerometers on bearing housings, motors, gearboxes - FFT (Fast Fourier Transform) analysis of frequency spectrum - Characteristic frequencies: BPFO, BPFI, BSF, FTF for different bearing defect types - Envelope analysis for detection of early-stage defects

Temperature: - Surface temperature of motors, bearings, transformers - Thermal cameras for non-contact measurement - Trend: slow increase above baseline = lubricant degradation, overload, clogged cooler

Electrical quantities: - Current and voltage — Motor Current Signature Analysis (MCSA) - Phase asymmetry = winding problem - Current increase at constant load = mechanical resistance

Other: - Acoustic emission — ultrasonic detection of leaks, partial discharge - Pressure — hydraulic systems, compressors, filtration systems - Flow — cooling circuits, lubrication systems - Humidity — transformer oil, insulation

Sensor infrastructure

Retrofit without cabling:

Wireless vibration sensors (ABB, SKF, Fluke) with 3-5 year battery. Installation in minutes — magnet or adhesive on bearing housing. Communication via BLE, Wi-Fi or LoRaWAN to gateway → cloud/edge.

Integration with existing PLCs:

Most modern PLCs already collect data from analogue inputs. No need to add sensors — just export data via OPC-UA. Caveat: PLCs typically sample slowly (1 Hz) — vibration requires 10-50 kHz (dedicated sensor).

Anomaly detection

Baseline — what is normal

Every machine has its “fingerprint of normal operation.” Vibration at 120 Hz with amplitude 2.5 mm/s is normal for motor X under load Y. Statistical model of normal:

  • Training period: 2-4 weeks of data collection in normal operation
  • Feature extraction: Statistical features (mean, std, RMS, kurtosis, crest factor), frequency features (dominant frequencies, spectral entropy)
  • Baseline model: Multivariate normal distribution or autoencoder

Anomaly detection methods

Statistical methods (quick to deploy): - Z-score per feature. Alert when |z| > 3 (3 sigma). - Control charts (Shewhart, CUSUM, EWMA). Industrial standard, well understood. - Advantage: interpretable, low false positives, no training required.

ML methods (more accurate, more complex): - Isolation Forest: Unsupervised, tree-based. Anomalies are “easily isolated” points. Fast, low memory. - One-class SVM: Boundary around normal data. Everything outside = anomaly. - Autoencoder: Neural network compresses and reconstructs input. High reconstruction error = anomaly. Handles multivariate, non-linear patterns. - Temporal models: LSTM predicts next timestep. Large deviation prediction vs. actual = anomaly.

Alert management

Not every anomaly is an alarm. Hierarchy:

  1. Info: Deviation detected, trend monitoring activated
  2. Warning: Deviation persists / grows. Schedule an inspection.
  3. Alert: High probability of failure within X days. Schedule maintenance.
  4. Critical: Immediate risk. Consider shutdown.

Fatigue prevention: Too many false alarms = operators start ignoring them. Target: <5% false positive rate. Alert tuning is a continuous process.

Remaining Useful Life (RUL) Prediction

Anomaly detection says: “something is wrong.” RUL prediction says: “it will last approximately 14 more days.”

Approaches

Physics-based models: Mathematical model of the degradation process (Paris law for crack propagation, Archard equation for wear). Accurate, but requires deep domain knowledge and specific failure mode understanding.

Data-driven models:

  1. Training data: Historical run-to-failure data — sensors from installation to failure. The more run-to-failure cycles, the better the model.
  2. Feature engineering: Sliding window statistics, trend features, frequency domain features.
  3. Model: Gradient boosted trees (XGBoost, LightGBM) for tabular data. LSTM/Transformer for sequence data. Survival analysis (Cox regression) for censored data.
  4. Output: Predicted RUL in days/hours + confidence interval.

Hybrid: Physics-informed neural networks — ML model with physical constraints. Best of both worlds.

Without historical failure data

No run-to-failure data? (Most companies don’t.) We start with:

  1. Anomaly detection — catches deviations from normal
  2. Degradation tracking — trends in key indicators
  3. Expert knowledge — maintenance team knows what indicators mean
  4. Gradual accumulation of failure data — each failure = training data for future model

Typically after 1-2 years of operation we accumulate enough data for a supervised RUL model.

IoT dashboards

Grafana as the visualisation platform

  • Fleet overview: All machines on one screen. Green/yellow/red. Drill-down to detail.
  • Machine detail: Real-time sensor data, trends, historical comparison, predictions.
  • Shift report: KPI per shift — OEE, availability, performance, quality.
  • Maintenance view: Open work orders, upcoming predictions, spare parts inventory.

Alerting integration

Alert from monitoring → automatic workflow:

  1. Anomaly detected → alert in Grafana
  2. Work order created in CMMS (SAP PM, Maximo, Fiix)
  3. Assignment to technician (automatic or manual)
  4. Technician diagnoses and repairs
  5. Repair confirmation → work order closed
  6. Feedback to ML model (was the prediction correct?)

Mobile access

Maintenance team in the field needs data on their phones:

  • Responsive Grafana dashboards
  • Push notifications on new alerts
  • QR code on the machine → opens the machine dashboard on mobile
  • Offline access to runbooks and documentation

Technology stack

Sensors: ABB, SKF, Fluke wireless vibration. Industrial RTD/thermocouple. Current transformers.

Data pipeline: MQTT, OPC-UA, Kafka, InfluxDB, TimescaleDB.

ML: scikit-learn, XGBoost, PyTorch (LSTM/Transformer), MLflow, Kubeflow.

Visualisation: Grafana, custom web dashboards.

Integration: SAP PM, IBM Maximo, Fiix CMMS, custom work order systems.

Edge: Processing on the edge for real-time anomaly detection (see Edge Computing).

Časté otázky

Minimum: vibration, temperature, current/voltage, operating hours. Ideally: historical failure data (what broke, when, under what conditions). The more historical data, the more accurate the predictions. But even without failure history we can start with anomaly detection.

Depends on data quality and quantity. Typically: anomaly detection catches 85-95% of incipient failures. RUL prediction with ±15-20% accuracy (i.e. predicted 14 days → actual failure in 12-17 days). Accuracy improves with more data.

Often not. We use data from existing PLCs and SCADA systems. If key sensors are missing (vibration on critical bearings), we recommend a retrofit — wireless sensors without any cabling.

Pilot on 5-10 machines: 2-3 months. Scale-out to the entire operation: 6-12 months. Typical payback period: 6-12 months through reduction of unplanned outages and optimisation of planned maintenance.

Máte projekt?

Pojďme si o něm promluvit.

Domluvit schůzku