Skip to content

[ENH] Implement outlier detection based on probabilistic regressors#777

Open
arnavk23 wants to merge 23 commits intosktime:mainfrom
arnavk23:fix/issue-390-outlier-detection
Open

[ENH] Implement outlier detection based on probabilistic regressors#777
arnavk23 wants to merge 23 commits intosktime:mainfrom
arnavk23:fix/issue-390-outlier-detection

Conversation

@arnavk23
Copy link
Copy Markdown
Contributor

@arnavk23 arnavk23 commented Feb 27, 2026

Reference Issues/PRs

Fixes #390

What does this implement/fix? Explain your changes.

It introduces three reduction strategies from probabilistic regression to outlier/anomaly detection with a PyOD-compatible interface:

  1. QuantileOutlierDetector - Detects outliers based on predictive quantile extremity. Samples falling outside the expected quantile range (configurable via alpha parameter) are flagged as outliers. The outlier score is computed as the distance from the nearest quantile bound, normalized by the quantile range.

  2. DensityOutlierDetector - Detects outliers based on probability density (negative log-likelihood). Samples with low probability density under the predictive distribution are flagged as outliers. Supports both log-likelihood and raw likelihood scoring via the use_log parameter.

  3. LossOutlierDetector - Detects outliers based on predictive loss. Supports multiple loss functions:

    • log_loss: negative log-likelihood (equivalent to density-based)
    • crps: Continuous Ranked Probability Score
    • interval_score: interval score with configurable coverage
    • Custom loss functions via callable

Key features:

  • PyOD-compatible interface with fit(), predict(), and decision_function() methods
  • Works with any skpro probabilistic regressor
  • Configurable contamination parameter for automatic threshold determination
  • Can be used with both conditional and unconditional distribution estimates

Implementation details:

  • Base class BaseOutlierDetector provides common functionality
  • All detectors compute outlier scores during training and use percentile-based thresholds

Does your contribution introduce a new dependency? If yes, which one?

No, this implementation uses only existing dependencies (numpy, pandas, scipy - already required by skpro).

What should a reviewer concentrate their feedback on?

  • Does the PyOD-compatible interface make sense for skpro users?
  • Are the outlier scoring methods mathematically sound and properly implemented?
  • Particularly in _quantile.py and _loss.py where we handle array reshaping and multi-output cases
  • The CRPS implementation in _loss.py uses a Normal distribution approximation - is this adequate or should we use quantile-based approximation instead?
  • Are the docstrings clear and comprehensive enough?

Did you add any tests for the change?

Yes, comprehensive test suite added in skpro/outlier/tests/test_outliers.py:

  • Tests for fitting all three detector types
  • Tests for prediction and decision_function methods
  • Tests for different loss functions (log_loss, CRPS, interval_score)
  • Tests for custom loss functions
  • Tests for error handling (missing y values)
  • Integration test verifying all detectors work with the same interface
  • Tests use @run_test_for_class decorator for proper test discovery

Example Output

The implementation includes a comprehensive example that demonstrates all three detector types on synthetic data with known outliers.
The visualization shows:

  • Outlier scores from QuantileOutlierDetector and DensityOutlierDetector
  • Detected outliers in feature space
  • Performance comparison (precision/recall) across different methods
outlier_detection_example

Any other comments?

  • The implementation is designed to be extensible - users can easily create custom detectors by subclassing BaseOutlierDetector and implementing _compute_decision_scores().

PR checklist

For all contributions
  • I've added myself to the list of contributors with any new badges I've earned :-)
  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. (Title: [ENH] Implement outlier detection based on probabilistic regressors)
For new estimators
  • I've added the estimator to the API reference - in docs/source/api_reference/taskname.rst, follow the pattern.
  • I've added one or more illustrative usage examples to the docstring, in a pydocstyle compliant Examples section.
  • If the estimator relies on a soft dependency, I've set the python_dependencies tag and ensured dependency isolation (N/A - no new soft dependencies)

Implements three types of outlier detectors as requested in issue sktime#390:

1. QuantileOutlierDetector - Detects outliers based on predictive quantile extremity.
   Samples falling outside the expected quantile range are flagged as outliers.

2. DensityOutlierDetector - Detects outliers based on probability density.
   Samples with low density (high negative log-likelihood) are flagged as outliers.

3. LossOutlierDetector - Detects outliers based on predictive loss.
   Supports multiple loss functions: log_loss, CRPS, interval_score, and custom losses.

Key features:
- PyOD-compatible interface with fit(), predict(), and decision_function() methods
- Works with any skpro probabilistic regressor
- Configurable contamination parameter for threshold determination
- Comprehensive test suite
- Example demonstrating usage with various regressors

Resolves sktime#390
Implements three types of outlier detectors as requested in issue sktime#390:

1. QuantileOutlierDetector - Detects outliers based on predictive quantile extremity.
   Samples falling outside the expected quantile range are flagged as outliers.

2. DensityOutlierDetector - Detects outliers based on probability density.
   Samples with low density (high negative log-likelihood) are flagged as outliers.

3. LossOutlierDetector - Detects outliers based on predictive loss.
   Supports multiple loss functions: log_loss, CRPS, interval_score, and custom losses.

Key features:
- PyOD-compatible interface with fit(), predict(), and decision_function() methods
- Works with any skpro probabilistic regressor
- Configurable contamination parameter for threshold determination
- Comprehensive test suite
- Example demonstrating usage with various regressors

Resolves sktime#390
@arnavk23 arnavk23 force-pushed the fix/issue-390-outlier-detection branch from 9870c21 to 9cf795d Compare February 27, 2026 20:07
@arnavk23
Copy link
Copy Markdown
Contributor Author

@fkiraly @marrov Could you please review this pr? I think this is entirely done here.

@marrov
Copy link
Copy Markdown
Member

marrov commented Mar 10, 2026

@arnavk23 - Pretty large PR, nothing jumps at me after a quick scan. I'll trigger the LLM review and come back to it when I have some bandwidth (maybe next week).

@marrov marrov requested a review from Copilot March 10, 2026 13:50
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new skpro.outlier module that implements outlier/anomaly detection by “reducing” the task to probabilistic regression, exposing a PyOD-like interface (fit, decision_function, predict).

Changes:

  • Introduces BaseOutlierDetector plus three detector implementations: quantile-, density-, and loss-based.
  • Adds documentation/API reference entries and a runnable example script.
  • Adds a new test suite covering detectors’ core interface and key options.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
skpro/outlier/base.py Adds shared PyOD-like base class and thresholding logic.
skpro/outlier/_quantile.py Implements quantile-interval based scoring logic.
skpro/outlier/_density.py Implements likelihood / negative-log-likelihood based scoring.
skpro/outlier/_loss.py Implements loss-based scoring incl. log-loss, CRPS, interval score, custom loss.
skpro/outlier/__init__.py Exposes new detectors via package exports.
skpro/outlier/tests/test_outliers.py Adds unit/integration tests for the new detectors.
skpro/outlier/tests/__init__.py Initializes tests package.
docs/source/api_reference/outlier.rst Documents new outlier module and classes in API reference.
docs/source/api_reference.rst Links outlier API reference into master index.
examples/outlier_detection_example.py Adds end-to-end usage example + optional visualization.
.all-contributorsrc Adds contributor entry for the PR author.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .all-contributorsrc
Comment thread skpro/outlier/tests/test_outliers.py
Comment thread examples/outlier_detection_example.py Outdated
Comment thread skpro/outlier/_quantile.py Outdated
Comment thread skpro/outlier/_quantile.py Outdated
Comment thread skpro/outlier/base.py Outdated
Comment thread skpro/outlier/base.py Outdated
Comment thread skpro/outlier/base.py
Comment thread skpro/outlier/base.py Outdated
Comment thread skpro/outlier/_quantile.py Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@arnavk23 arnavk23 marked this pull request as draft March 10, 2026 17:20
- keep fallback CRPS sample/output structure intact and reduce per sample
- normalize callable loss outputs to one score per sample, with clear shape
  errors for invalid returns
- make BaseOutlierDetector y normalization consistent with y_inner_mtype by
  using DataFrame internally in fit and decision_function
- clarify base outlier detector docs: supervised by default, y=None only for
  pre-fitted regressors with X-only scoring implementations
- add regression tests for CRPS fallback shapes, callable loss reduction and
  validation, and canonical DataFrame y normalization
- fix malformed JSON in .all-contributorsrc and remove an unused import from
  the outlier detection example
Formats skpro/outlier/_quantile.py to match the repository black hook output
so pre-commit and CI pass cleanly.
@arnavk23 arnavk23 marked this pull request as ready for review March 13, 2026 21:01
@arnavk23 arnavk23 requested a review from Copilot April 11, 2026 14:13
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread skpro/outlier/base.py
Comment thread skpro/outlier/base.py
Comment thread skpro/outlier/_quantile.py Outdated
Comment thread skpro/outlier/_loss.py
Comment thread examples/outlier_detection_example.py
Comment thread skpro/outlier/base.py Outdated
Comment thread skpro/outlier/base.py
arnavk23 and others added 4 commits April 12, 2026 03:19
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (1)

docs/source/api_reference.rst:24

  • The toctree lists api_reference/tags twice. This can lead to duplicate entries / Sphinx warnings and is likely unintended; remove the duplicate (keep a single api_reference/tags entry).
.. toctree::
    :maxdepth: 1

    api_reference/tags
    api_reference/regression
    api_reference/survival
    api_reference/outlier
    api_reference/distributions
    api_reference/metrics
    api_reference/tags
    api_reference/base

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread skpro/outlier/base.py Outdated
Comment thread skpro/outlier/_quantile.py Outdated
Comment thread skpro/outlier/_quantile.py Outdated
Comment thread skpro/outlier/_loss.py Outdated
Comment thread examples/outlier_detection_example.py Outdated
Comment thread .all-contributorsrc
arnavk23 and others added 3 commits April 12, 2026 04:31
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Replaced brittle substring-based lower/upper extraction in _compute_interval_score with a dedicated helper _extract_interval_bounds.
_extract_interval_bounds now:
Prefers MultiIndex selection via:
.xs("lower", level=-1, axis=1)
.xs("upper", level=-1, axis=1)
Falls back to old formats only when needed (substring columns, then split-half fallback).
Supports array-like interval outputs for 2D/3D layouts.
Added explicit shape validation in _compute_interval_score to ensure lower/upper align with y_true sample/output dimensions; raises a clear ValueError on mismatch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ENH] outlier detection based on probabilistic regressors

3 participants