[ENH] Implement outlier detection based on probabilistic regressors#777
[ENH] Implement outlier detection based on probabilistic regressors#777arnavk23 wants to merge 23 commits intosktime:mainfrom
Conversation
Implements three types of outlier detectors as requested in issue sktime#390: 1. QuantileOutlierDetector - Detects outliers based on predictive quantile extremity. Samples falling outside the expected quantile range are flagged as outliers. 2. DensityOutlierDetector - Detects outliers based on probability density. Samples with low density (high negative log-likelihood) are flagged as outliers. 3. LossOutlierDetector - Detects outliers based on predictive loss. Supports multiple loss functions: log_loss, CRPS, interval_score, and custom losses. Key features: - PyOD-compatible interface with fit(), predict(), and decision_function() methods - Works with any skpro probabilistic regressor - Configurable contamination parameter for threshold determination - Comprehensive test suite - Example demonstrating usage with various regressors Resolves sktime#390
Implements three types of outlier detectors as requested in issue sktime#390: 1. QuantileOutlierDetector - Detects outliers based on predictive quantile extremity. Samples falling outside the expected quantile range are flagged as outliers. 2. DensityOutlierDetector - Detects outliers based on probability density. Samples with low density (high negative log-likelihood) are flagged as outliers. 3. LossOutlierDetector - Detects outliers based on predictive loss. Supports multiple loss functions: log_loss, CRPS, interval_score, and custom losses. Key features: - PyOD-compatible interface with fit(), predict(), and decision_function() methods - Works with any skpro probabilistic regressor - Configurable contamination parameter for threshold determination - Comprehensive test suite - Example demonstrating usage with various regressors Resolves sktime#390
9870c21 to
9cf795d
Compare
|
@arnavk23 - Pretty large PR, nothing jumps at me after a quick scan. I'll trigger the LLM review and come back to it when I have some bandwidth (maybe next week). |
There was a problem hiding this comment.
Pull request overview
Adds a new skpro.outlier module that implements outlier/anomaly detection by “reducing” the task to probabilistic regression, exposing a PyOD-like interface (fit, decision_function, predict).
Changes:
- Introduces
BaseOutlierDetectorplus three detector implementations: quantile-, density-, and loss-based. - Adds documentation/API reference entries and a runnable example script.
- Adds a new test suite covering detectors’ core interface and key options.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
skpro/outlier/base.py |
Adds shared PyOD-like base class and thresholding logic. |
skpro/outlier/_quantile.py |
Implements quantile-interval based scoring logic. |
skpro/outlier/_density.py |
Implements likelihood / negative-log-likelihood based scoring. |
skpro/outlier/_loss.py |
Implements loss-based scoring incl. log-loss, CRPS, interval score, custom loss. |
skpro/outlier/__init__.py |
Exposes new detectors via package exports. |
skpro/outlier/tests/test_outliers.py |
Adds unit/integration tests for the new detectors. |
skpro/outlier/tests/__init__.py |
Initializes tests package. |
docs/source/api_reference/outlier.rst |
Documents new outlier module and classes in API reference. |
docs/source/api_reference.rst |
Links outlier API reference into master index. |
examples/outlier_detection_example.py |
Adds end-to-end usage example + optional visualization. |
.all-contributorsrc |
Adds contributor entry for the PR author. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- keep fallback CRPS sample/output structure intact and reduce per sample - normalize callable loss outputs to one score per sample, with clear shape errors for invalid returns - make BaseOutlierDetector y normalization consistent with y_inner_mtype by using DataFrame internally in fit and decision_function - clarify base outlier detector docs: supervised by default, y=None only for pre-fitted regressors with X-only scoring implementations - add regression tests for CRPS fallback shapes, callable loss reduction and validation, and canonical DataFrame y normalization - fix malformed JSON in .all-contributorsrc and remove an unused import from the outlier detection example
Formats skpro/outlier/_quantile.py to match the repository black hook output so pre-commit and CI pass cleanly.
…com/arnavk23/skpro into fix/issue-390-outlier-detection
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…com/arnavk23/skpro into fix/issue-390-outlier-detection
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (1)
docs/source/api_reference.rst:24
- The toctree lists
api_reference/tagstwice. This can lead to duplicate entries / Sphinx warnings and is likely unintended; remove the duplicate (keep a singleapi_reference/tagsentry).
.. toctree::
:maxdepth: 1
api_reference/tags
api_reference/regression
api_reference/survival
api_reference/outlier
api_reference/distributions
api_reference/metrics
api_reference/tags
api_reference/base
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Replaced brittle substring-based lower/upper extraction in _compute_interval_score with a dedicated helper _extract_interval_bounds.
_extract_interval_bounds now:
Prefers MultiIndex selection via:
.xs("lower", level=-1, axis=1)
.xs("upper", level=-1, axis=1)
Falls back to old formats only when needed (substring columns, then split-half fallback).
Supports array-like interval outputs for 2D/3D layouts.
Added explicit shape validation in _compute_interval_score to ensure lower/upper align with y_true sample/output dimensions; raises a clear ValueError on mismatch.
…com/arnavk23/skpro into fix/issue-390-outlier-detection
Reference Issues/PRs
Fixes #390
What does this implement/fix? Explain your changes.
It introduces three reduction strategies from probabilistic regression to outlier/anomaly detection with a PyOD-compatible interface:
QuantileOutlierDetector - Detects outliers based on predictive quantile extremity. Samples falling outside the expected quantile range (configurable via
alphaparameter) are flagged as outliers. The outlier score is computed as the distance from the nearest quantile bound, normalized by the quantile range.DensityOutlierDetector - Detects outliers based on probability density (negative log-likelihood). Samples with low probability density under the predictive distribution are flagged as outliers. Supports both log-likelihood and raw likelihood scoring via the
use_logparameter.LossOutlierDetector - Detects outliers based on predictive loss. Supports multiple loss functions:
log_loss: negative log-likelihood (equivalent to density-based)crps: Continuous Ranked Probability Scoreinterval_score: interval score with configurable coverageKey features:
fit(),predict(), anddecision_function()methodscontaminationparameter for automatic threshold determinationImplementation details:
BaseOutlierDetectorprovides common functionalityDoes your contribution introduce a new dependency? If yes, which one?
No, this implementation uses only existing dependencies (numpy, pandas, scipy - already required by skpro).
What should a reviewer concentrate their feedback on?
_quantile.pyand_loss.pywhere we handle array reshaping and multi-output cases_loss.pyuses a Normal distribution approximation - is this adequate or should we use quantile-based approximation instead?Did you add any tests for the change?
Yes, comprehensive test suite added in
skpro/outlier/tests/test_outliers.py:@run_test_for_classdecorator for proper test discoveryExample Output
The implementation includes a comprehensive example that demonstrates all three detector types on synthetic data with known outliers.
The visualization shows:
Any other comments?
BaseOutlierDetectorand implementing_compute_decision_scores().PR checklist
For all contributions
[ENH] Implement outlier detection based on probabilistic regressors)For new estimators
docs/source/api_reference/taskname.rst, follow the pattern.Examplessection.python_dependenciestag and ensured dependency isolation (N/A - no new soft dependencies)