This project compares unsupervised anomaly detection approaches for identifying abnormal patterns in bearing vibration sensor data. It benchmarks six PyCaret-based [1] anomaly detection models against a custom Bidirectional LSTM (BiLSTM) autoencoder implemented in PyTorch, using the IMS Bearing dataset derived from the NASA Acoustics and Vibration Database.
The study focuses on whether low-code anomaly detection methods can achieve performance comparable to a custom deep learning approach while reducing implementation complexity and development effort.
Industrial systems generate continuous streams of sensor time-series data, and early detection of anomalous behaviour is important for predictive maintenance, fault diagnosis, and operational reliability. In this setting, anomaly detection is challenging because labelled anomalies are often unavailable, contamination levels may be unknown, and model performance can vary across unseen datasets.
Several prior studies have reported strong results on bearing fault detection using deep learning and semi-supervised approaches [2][3][4][5][7]. However, these approaches often rely on selected train subsets or are evaluated on limited portions of the available data. This makes it difficult to assess how well they generalise to unseen datasets or whether lower-complexity methods could provide comparable performance.
The aim of this study is to compare classical unsupervised anomaly detection models with a custom BiLSTM autoencoder on the same bearing sensor datasets. In particular, the project evaluates whether PyCaret offers similar or better performance than a neural-network-based approach while optimising resources and reducing coding effort.
Model performance is assessed using cluster quality metrics and non-parametric statistical comparison, including the Friedman test and Friedman-Conover post hoc analysis [6].
The data were sourced from Kaggle and comprise three datasets of vibration sensor readings from the NASA Acoustics and Vibration Database. The datasets contain text files with 1-second vibration signal snapshots (20,480 data points) recorded at 5- and 10-minute intervals at a sampling rate of 20 kHz.
This project evaluates the following unsupervised anomaly detection models available through PyCaret:
| ID | Name | Reference |
|---|---|---|
| cluster | Clustering-Based Local Outlier | pyod.models.cblof.CBLOF |
| iforest | Isolation Forest | pyod.models.iforest.IForest |
| histogram | Histogram-based Outlier Detection | pyod.models.hbos.HBOS |
| knn | K-Nearest Neighbours Detector | pyod.models.knn.KNN |
| svm | One-Class SVM Detector | pyod.models.ocsvm.OCSVM |
| mcd | Minimum Covariance Determinant | pyod.models.mcd.MCD |
PyCaret is a high-performance, open-source low-code machine learning library that supports anomaly detection and automates key parts of the ML workflow [1].
A custom Bidirectional LSTM autoencoder was implemented in PyTorch to encode and reconstruct the vibration signal input. All experiments were run for 50 epochs with a learning rate of 2e-4 and a batch size of 32. The architecture used one BiLSTM layer with 32 hidden units and a dropout of 0.1.
| Exp | Model | Loss | Optim |
|---|---|---|---|
| 1 | bilstm | mae_loss | adam |
| 2 | bilstm | huber_loss | adam |
| 3 | bilstm | mae_loss | adamw |
| 4 | bilstm | huber_loss | adamw |
The models were compared using cluster quality metrics and non-parametric statistical testing.
Cluster metrics:
- Silhouette Score measures how well-separated the clusters are. Values range from -1 to 1, with higher values indicating better-defined clusters.
- Calinski-Harabasz Index measures between-cluster separation relative to within-cluster dispersion. Higher values indicate better-defined clusters.
- Davies-Bouldin Index measures similarity between each cluster and its most similar cluster. Lower values indicate better separation.
Because these metrics can favour convex cluster structures, model performance was also compared using non-parametric statistical analysis, specifically the Friedman test and Friedman-Conover post hoc test [6].
Tables 3 and Figures 2-5 show the anomalies detected for each of the selected PyCaret and BiLST models for the training dataset and the independent test dataset.
Table 3: Anomalies detected by model for the training and test datasets.
| Model | Anomalies - training dataset |
Anomalies- Test dataset |
|---|---|---|
| Cluster | 50 | - |
| Histogram | 50 | - |
| iforest | 50 | 187 |
| KNN | 50 | - |
| MCD | 50 | 78 |
| SVM | 50 | 95 |
| Exp-01 | 99 | 190 |
| Exp-02 | 98 | 191 |
| Exp-03 | 99 | 190 |
| Exp-04 | 104 | 191 |
Cluster Metrics
The models that obtained the highest Calinski-Harabasz and Davies-Bouldin indexes were SVM, MCV and Histogram. These models, in addition to IForest, showed the highest Silhouette scores (Table 4).
Non-parametric Comparison
-
From the nonparametric statistical Conover-Friedman test, we found a significant difference in detecting anomalies among the models with 95 % certainty.
-
For the training dataset, there is no significant difference in the performance of each Pycaret model. However, all BiLSTM experiments differed significantly from all PyCaret models (Table 6, Figure 6).
-
All BiLSTM experiments were not significantly different from each other, as shown in the Friedman-Conover and critical difference diagrams.
-
Exp-04 ranked the highest scored but was not significantly different to Exp-02 to Exp-04. The anomaly could have been detected with 22:00 lead time with Exp-02 and Exp-04, while for the PyCaret models, the Clustering-Based Local outlier cluster could have detected the anomalies 14:40 hr in advance.
Cluster Metrics
MCD and SVM obtained the highest Silhouette scores and Calinski-Harabasz and Davies-Bouldin indexes (Table 5).
-
From the nonparametric test, we can reject the null hypothesis that the performance of all models at detecting anomalies is not significantly different with 95% certainty
-
The ranking of the models shows that the best model. The Conover-Friedman test and critical difference diagram showed no statistical difference amongst experiments Exp-01 to Exp-04 and iForest with 95% certainty (Table 6, Figure 7).
-
Similarly, MCD and SVM showed no significant difference in detecting anomalies but significantly differed from all other models.
-
Experiments Exp-02 and Exp-04, which minimised the Huber-Loss, ranked the highest of all models.
-
The Conover-Friedman test and critical difference diagram showed no statistical difference amongst experiments Exp-01 to Exp-04 and iForest.
-
Any of the BiLSTM models could have detected the anomaly within a 27:00 hr lead time, specifically Exp-03 within 27:20:00, while Iforest detected the anomalies within a lead time of 25:30:00.
Train dataset - Dataset 2 (avg_df2)
Figure 2. Anomalies detected by the PyCaret models on the training dataset.
Test dataset - Dataset 3 (avg_df3)
Figure 3. Anomalies detected by the PyCaret models on the test dataset.
Train dataset - Dataset 2 (avg_df2)
Figure 4. Anomalies distribution detected on the training dataset. The experimental setup is outlined in Table 2. A. Exp-01, B.Exp-02, C. Exp-03, D.Exp-04
Test dataset - Dataset 3 (avg_df3)
Figure 5. Anomalies distribution detected on the test dataset. The experimental setup is outlined in Table 2. A. Exp-01, B.Exp-02, C. Exp-03, D.Exp-04.
Table 4: Cluster metrics on the training dataset.
| index | silhoutte | calinski_harabasz | davies_bouldin |
|---|---|---|---|
| cluster | 0.7762 | 646.9312 | 0.8175 |
| histogram | 0.8124 | 1001.4351 | 0.6754 |
| iforest | 0.8124 | 992.9829 | 0.6791 |
| knn | 0.8017 | 897.689 | 0.7167 |
| mcd | 0.8124 | 1004.5345 | 0.6739 |
| svm | 0.814 | 1011.3661 | 0.6723 |
| exp1 | 0.7442 | 879.5458 | 0.7615 |
| exp2 | 0.747 | 892.5695 | 0.7562 |
| exp3 | 0.7442 | 879.5458 | 0.7615 |
| exp4 | 0.7386 | 875.7697 | 0.764 |
Table 5: Cluster metrics on the test dataset.
| index | silhoutte | calinski_harabasz | davies_bouldin |
|---|---|---|---|
| iforest | 0.923 | 7629.4839 | 0.7218 |
| mcd | 0.9557 | 15873.0144 | 0.3554 |
| svm | 0.9533 | 16564.1182 | 0.3939 |
| exp1 | 0.9263 | 8549.185 | 0.6857 |
| exp2 | 0.9256 | 8389.7269 | 0.692 |
| exp3 | 0.9267 | 8656.1093 | 0.6818 |
| exp4 | 0.9256 | 8433.0381 | 0.6902 |
Table 6: Models performance ranking for the training and test datasets.
| Model | Training Ranks | Test Ranks |
|---|---|---|
| Exp-01 | 0.5647 | 0.5738 |
| Exp-02 | 0.5642 | 0.5739 |
| Exp-03 | 0.5647 | 0.5738 |
| Exp-04 | 0.5673 | 0.5739 |
| Cluster | 0.5398 | - |
| Histogram | 0.5398 | - |
| iForest | 0.5398 | 0.5735 |
| KNN | 0.5398 | - |
| MCD | 0.5398 | 0.5663 |
| SVM | 0.5398 | 0.5649 |
- Friedman-Chisquare nonparametric statistical test - p-value = 8.85e-75. Therefore the H0 is rejected.
- Posthoc- Friedman-Conover pairwise comparison
- Critical Difference Diagram
Figure 6. Figure 6. Conover-Friedman posthoc comparison training results and critical distance on the unseen test data by model.
- Non-parametri test - Friedman-chisquare pvalue = 6.51e-19. Therefore the H0 is rejected.
- Posthoc- Friedman-Conover pairwise comparison
- Critical Difference Diagram
Figure 6. Conover-Friedman posthoc comparison test results and critical distance on the unseen test data by model.
In summary, Exp-04 consistently obtained the best performance in both datasets. The test dataset was almost six times larger than the train set. It also presented spikes at the beginning and middle of the test. We can observe that the data's size and quality impact the model's choice to detect failures. Models Histogram, Cluster, and KNN were excluded in the test comparison since they accounted for more than 50% of the test dataset which is inaccurate as per the data visualisations. These models showed to be less robust for unseen data.
Conversely, the IForest model took the lead in detecting anomalies together with Exp-01 and Exp-04 with no significant difference in their performance for the unseen test dataset. It showed to be more robust to changes independently of dataset size and unknown contamination.
We can conclude that the PyCaret Anomalies models selected in this work and the Bilstm (Bidirectional LSTM) Artificial Neural Networks can detect failures on the bearing sensors' signals at the same performance level. Whether these models can detect failures days or weeks in advance in other unseen data requires further testing and optimisation.
[1] PyCaret
[2] https://towardsdatascience.com/lstm-autoencoder-for-anomaly-detection-e1f4f2ee7ccf
[3] https://towardsdatascience.com/machine-learning-for-anomaly-detection-and-condition-monitoring-d4614e7de770
[4] https://sabtainahmadml.medium.com/condition-monitoring-through-diagnosis-of-anomalies-lstm-based-unsupervised-ml-approach-5f0565735dff
[5] Zhang, R.; Peng, Z.; Wu, L.; Yao, B.; Guan, Y. Fault Diagnosis from Raw Sensor Data Using Deep Neural Networks Considering Temporal Coherence. Sensors 2017, 17, 549. https://www.mdpi.com/1424-8220/17/3/549
[6] Goldstein, M. and Uchida, S. (2016) ‘A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data’, PLOS ONE, 11(4). doi:10.1371/journal.pone.0152173.
[7] K. Choi, J. Yi, C. Park and S. Yoon, "Deep Learning for Anomaly Detection in Time-Series Data: Review, Analysis, and Guidelines," in IEEE Access, vol. 9, pp. 120043-120065, 2021, doi: 10.1109/ACCESS.2021.3107975.













