Predictive Maintenance in Automotive Manufacturing

The client is a major automotive manufacturer with several production plants in Central Europe. Production lines include hundreds of CNC machines, robotic arms, presses, and conveyor systems — more than 2,000 critical components in total. An unplanned outage of a single machine can halt an entire production line at a cost exceeding $50,000 per hour of downtime.

The existing approach to maintenance was either reactive (repair after failure) or preventive (regular intervals regardless of actual machine condition). Both approaches had fundamental drawbacks — reactive maintenance led to unplanned outages, while preventive maintenance wasted resources on servicing machines that did not need it.

Our task was to build a predictive maintenance platform that, based on real-time sensor data, can predict failures with sufficient lead time for a planned intervention.

Challenge¶

Data volume and velocity¶

2,000 sensors generate data every second — vibrations, temperature, pressure, energy consumption, acoustic emissions, RPM, and dozens of other parameters. This means:

2 million data points per minute during peak operation
Sub-second latency — anomalies must be detected in real time, not in batch processing
Historical data for ML — years of historical measurements for model training
Edge processing — some data must be processed directly at the plant due to latency and bandwidth constraints

Heterogeneous environment¶

A manufacturing plant is not a greenfield. Machines come from dozens of different manufacturers, use different communication protocols, and have varying levels of digitization:

Modern CNC machines — OPC-UA, MQTT, rich telemetry interfaces
Legacy equipment — serial communication, proprietary protocols, minimal sensor instrumentation
Retrofitted machines — aftermarket sensors with custom gateways
Different time bases — sampling frequencies from 1 Hz to 10 kHz depending on sensor type

Defining “normal”¶

Every machine has different operating characteristics. What constitutes normal vibration for a press is an alarm for a precision CNC machining centre. Moreover, “normal” changes depending on:

Production program — different product = different machine load
Ambient temperature — seasonal variations affect cooling systems
Tool age — wear gradually and legitimately changes the vibration profile
Shift operation — different operators, different settings

Solution¶

IoT infrastructure¶

We designed a three-tier IoT architecture:

Edge layer — industrial gateways in each production hall process raw sensor data. Edge computing performs the first level of filtering, aggregation, and detection of obvious anomalies (absolute threshold violations). Critical alerts are sent immediately.

Transport layer — Apache Kafka serves as the backbone for data transport from edge to cloud. Kafka guarantees reliable data delivery even during connectivity outages, automatic scaling during peaks, and the ability to replay historical data.

Cloud layer — Apache Flink processes data streams in real time, performing complex windowed aggregations, cross-sensor correlations, and ML model evaluation. Results are stored in TimescaleDB for historical analysis and visualization.

ML models for anomaly detection¶

We developed a suite of specialized ML models for different failure types:

Autoencoder for vibration analysis — a neural network trained on normal operating patterns reconstructs the input signal. High reconstruction error indicates an anomaly. The model catches subtle changes in the vibration spectrum that a human operator would miss.
LSTM for degradation prediction — a recurrent network tracks the trend of key parameters over time and predicts the Remaining Useful Life (RUL) of components. RUL prediction accuracy: plus or minus 12 hours for critical components.
Isolation Forest for multivariate anomalies — detection of unusual parameter combinations that are individually within normal range but together indicate a problem.
Correlation models — identification of cascading failures where a problem on one machine affects downstream equipment.

Models are continuously retrained on new data with automatic A/B testing of new versions.

Alerting and workflow¶

The system categorizes detected anomalies into three levels:

Informational — deviation from normal, monitor the trend. Displayed in the dashboard, no immediate action required.
Warning — significant anomaly, inspection recommended during the next planned shutdown. Notification to the maintenance lead.
Critical — failure predicted within 72 hours. Automatic work order creation in the CMMS system, escalation to the shift supervisor.

Each alert includes: machine and component identification, anomaly visualization with historical context, prediction confidence score, recommended intervention, and estimated remaining useful life.

Visualization and reporting¶

Grafana dashboards provide a real-time overview:

Plant overview — health score of each machine on a production hall map
Machine detail — live telemetry data, historical trend, RUL prediction
Maintenance overview — planned and predicted interventions, maintenance team utilization
Management reporting — KPIs: OEE, MTBF, MTTR, savings compared to reactive maintenance

Results¶

30% reduction in unplanned downtime¶

In its first year of operation, the system predicted 87% of failures with sufficient lead time for planned intervention. Unplanned downtime decreased by 30%, and mean time to repair (MTTR) dropped by 40% thanks to better preparation of spare parts and personnel.

$2.4M annual savings¶

The combination of reduced unplanned outages, optimized preventive maintenance, and extended component lifespan delivered annual savings of $2.4M. The largest contribution came from reduced production losses during unplanned shutdowns.

99.7% prediction accuracy¶

The model achieves 99.7% accuracy in anomaly detection with a false positive rate under 2%. This means the maintenance team trusts the system’s alerts and responds to them without unnecessary verification steps.

72-hour early warning¶

The average time between anomaly detection and actual failure is 72 hours. This provides sufficient time to order spare parts, schedule downtime during a less loaded shift, and prepare the maintenance team.

Technologies

PythonApache KafkaApache FlinkTensorFlowAzure IoT HubTimescaleDBGrafanaKubernetes