Conversation
himax12
commented
Apr 18, 2026
- fix: add coverage parameter to fit_predict and fit_predict_async
- feat: add compare_estimators tool for agentic model selection
- docs: update installation docs to recommend uvx
- fix: add cycle detection to sanitize_for_json
This fix addresses issue sktime#176 where the coverage parameter was silently dropped in async workflows, preventing prediction intervals from being returned. Changes: - Add predict_interval() method to Executor class that uses sktime's predict_interval() for probabilistic forecasting - Add coverage parameter to Executor.fit_predict() - when provided, uses predict_interval() instead of predict() - Add coverage parameter to Executor.fit_predict_async() - same behavior - Update fit_predict_tool() to accept and pass coverage parameter - Update fit_predict_async_tool() to accept and pass coverage parameter - Update MCP server tool schemas for fit_predict and fit_predict_async to expose coverage parameter in the API The coverage parameter accepts: - A single float (e.g., 0.9 for 90% prediction intervals) - A list of floats for multiple interval levels (e.g., [0.5, 0.9, 0.95]) Only works with estimators that support prediction intervals (check capability:pred_int tag).
This implements issue sktime#178 - a new compare_estimators MCP tool that allows LLM agents to automatically compare multiple models on the same dataset and select the best one. Features: - Compare multiple estimator handles using cross-validation - Support for MAE, MAPE, MSE, RMSE, SMAPE, MASE, MedAE metrics - Works with both demo datasets and custom data_handle - Returns ranked results with best estimator The tool enables the full agentic loop: extract_ts_metadata() <- know what data looks like instantiate_estimator() x N <- create candidate models compare_estimators() <- NEW: rank all candidates automatically fit_predict(best_handle) <- execute with the winner Implementation: - Added _get_metric_instance() helper to map metric names to sktime classes - Added _extract_metric_value() to extract metrics from evaluate results - Added compare_estimators_tool() function - Registered tool in server.py with full MCP schema
This implements the feature request to use uvx as the recommended installation method for sktime-mcp. Changes: - README.md: Lead with uvx as the recommended approach, update Quick Start and Beginner Setup sections - docs/source/intro.md: Update Quick Start to lead with uvx - docs/source/user-guide.md: Update Getting Started to lead with uvx uvx provides a no-install experience similar to npx for Node.js, automatically handling environment isolation and dependencies. This removes the manual venv creation step from user-facing docs.
RecursionError from circular references in sktime estimators Problem: sanitize_for_json recursively walked objects but had no cycle detection. If an sktime estimator contains circular references in its attributes (parent/child pointers), this causes RecursionError and crashes the MCP server. Solution: - Track visited objects by id() in a _seen set - When a cycle is detected (id already seen), return str(obj) instead - Also recurse into __dict__ to catch cycles in object attributes (previously we just converted to string without traversing) This handles: - Self-referential dicts: d['key'] = d - Nested cycles: d['a']['b'] = d - Object parent pointers: a.parent = b; b.parent = a
There was a problem hiding this comment.
Pull request overview
This PR expands the sktime-mcp tool surface to support prediction-interval workflows, introduces an estimator-comparison tool for agentic model selection, updates installation docs to recommend uvx, and hardens JSON sanitization by adding cycle detection.
Changes:
- Add
coveragesupport tofit_predict/fit_predict_asyncto request prediction intervals. - Add
compare_estimatorstool that runs CV-based evaluation and ranks candidate estimators. - Update docs/README to recommend running via
uvxand bump documented Python prerequisite to 3.10+. - Add cycle detection to
sanitize_for_jsonto avoid recursion errors on circular references.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
src/sktime_mcp/tools/fit_predict.py |
Adds coverage parameter plumbing and updates tool docstrings/examples. |
src/sktime_mcp/tools/evaluate.py |
Adds metric helpers and new compare_estimators_tool implementation. |
src/sktime_mcp/server.py |
Registers new tool, threads coverage through schemas/calls, and adds cycle detection to JSON sanitization. |
src/sktime_mcp/runtime/executor.py |
Adds predict_interval plus coverage support in sync/async fit-predict flows. |
docs/source/user-guide.md |
Updates installation guidance to recommend uvx and Python 3.10+. |
docs/source/intro.md |
Updates “getting started” instructions to highlight uvx. |
README.md |
Reworks install/quickstart instructions around uvx and pip fallback. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def compare_estimators_tool( | ||
| estimator_handles: list[str], | ||
| dataset: Optional[str] = None, | ||
| data_handle: Optional[str] = None, | ||
| metric: str = "MAPE", | ||
| cv_folds: int = 3, | ||
| horizon: int = 12, | ||
| ) -> dict[str, Any]: |
There was a problem hiding this comment.
New compare_estimators_tool behavior (metric parsing, CV splitter construction, ranking, and error aggregation) isn’t covered by tests. Since evaluate_estimator_tool already has coverage in tests/test_evaluate.py, please add targeted tests for compare_estimators_tool (including default metric/horizon path and a small list of handles).
| return { | ||
| "success": True, | ||
| "predictions": pred_intervals_copy.to_dict(), | ||
| "intervals": intervals_dict, | ||
| "horizon": len(fh) if hasattr(fh, "__len__") else fh, | ||
| "coverage": coverage, | ||
| } |
There was a problem hiding this comment.
predict_interval() returns the interval DataFrame (lower/upper) under the predictions key (pred_intervals_copy.to_dict()), but predict_interval in sktime does not return point forecasts. This breaks the existing API contract where predictions is the point forecast (and will also make fit_predict(..., coverage=...) return intervals instead of point predictions). Consider returning point forecasts from instance.predict(...) in predictions, and put interval data only under intervals (or rename the interval payload to avoid ambiguity).
| if _seen is None: | ||
| _seen = set() | ||
|
|
||
| obj_id = id(obj) | ||
| if obj_id in _seen: | ||
| return str(obj) | ||
| _seen.add(obj_id) |
There was a problem hiding this comment.
Cycle detection currently tracks all objects (including None, bool, small ints, and interned strings). Because these share stable identities, repeated values (e.g. [1, 1] or [None, None]) will be treated as cycles and coerced to str(obj), changing types/values (1 -> "1", None -> "None"). Only add objects to _seen for container / user-defined objects that can actually participate in reference cycles (dict/list/tuple/custom objects), and skip primitives/immutables when doing the id()/_seen check.
| @@ -27,83 +27,90 @@ This MCP is **not** just documentation or static code analysis. It is a **semant | |||
| ## 🛠️ Prerequisites | |||
|
|
|||
| - **Python 3.10+** (3.9 is listed in `pyproject.toml` but the `mcp` package requires 3.10+) | |||
There was a problem hiding this comment.
README says “3.9 is listed in pyproject.toml”, but pyproject.toml now sets requires-python = ">=3.10". Please update/remove the parenthetical so prerequisites stay accurate.
| - **Python 3.10+** (3.9 is listed in `pyproject.toml` but the `mcp` package requires 3.10+) | |
| - **Python 3.10+** |
| # Convert column names to find the matching metric | ||
| metric_col = None | ||
| for col in results.columns: | ||
| if col == metric_name or col == f"test_{metric_name}": | ||
| metric_col = col | ||
| break | ||
| # Check case-insensitive match | ||
| col_lower = col.lower() | ||
| metric_lower = metric_name.lower() | ||
| if metric_lower in col_lower or col_lower == f"test_{metric_lower}": | ||
| metric_col = col | ||
| break |
There was a problem hiding this comment.
_extract_metric_value() tries to match acronyms like MAPE/MAE against evaluate() result columns, but the columns are typically named like test_MeanAbsolutePercentageError (see existing tests/test_evaluate.py). As a result, compare_estimators_tool() will fail to extract the default MAPE score and report errors for all estimators. Consider mapping acronyms to the actual scorer name (e.g., use scoring.name/type(scoring).__name__ or a dict of acronym -> full column name) and then look up test_<scorer_name> explicitly.
| >>> fit_predict_tool("est_abc123", "airline", horizon=12, coverage=0.9) | ||
| { | ||
| "success": True, | ||
| "predictions": {...}, | ||
| "intervals": {"Airline": {"coverage": 0.9, "lower": {...}, "upper": {...}}}, | ||
| "horizon": 12, | ||
| "coverage": 0.9 |
There was a problem hiding this comment.
The fit_predict_tool doc example for intervals doesn’t match the current Executor.predict_interval() output shape. predict_interval() builds keys like f"{var_name}_{cov_level}" (e.g., Airline_0.9), not just Airline. Please update the docstring example (and/or align the implementation to the documented structure) so users know what to expect.
| # Use predict_interval if coverage is provided, otherwise use predict | ||
| if coverage is not None: | ||
| return self.predict_interval(handle_id, fh=fh, X=X, coverage=coverage) | ||
| return self.predict(handle_id, fh=fh, X=X) |
There was a problem hiding this comment.
The new coverage path (fit_predict -> predict_interval and async equivalent) isn’t covered by tests. There are already tests for fit_predict_async in tests/test_background_jobs.py; please add cases that pass coverage and assert the response includes point forecasts plus interval bounds (and that unsupported estimators return the expected error).
|
|
||
| import logging | ||
| from typing import Any | ||
| from typing import Any, Optional, Union |
There was a problem hiding this comment.
Union is imported but not used in this module, which will fail ruff check (F401). Remove the unused import or use it in a type annotation.
| from typing import Any, Optional, Union | |
| from typing import Any, Optional |
| if not dataset and not data_handle: | ||
| return { | ||
| "success": False, | ||
| "error": "Either dataset or data_handle must be provided" | ||
| } |
There was a problem hiding this comment.
compare_estimators_tool documents “use dataset or data_handle, not both”, but the validation only errors when neither is provided. Add an explicit check to reject requests where both dataset and data_handle are set to avoid ambiguous behavior.
| eval_results = evaluate( | ||
| forecaster=instance, | ||
| y=y, | ||
| X=X, | ||
| cv=cv, | ||
| scoring=scoring, | ||
| ) |
There was a problem hiding this comment.
evaluate() will fit the provided forecaster during CV, which mutates the estimator instance stored under the handle. That means running compare_estimators_tool can unexpectedly change/overwrite a user’s live estimator state. Prefer cloning the forecaster per evaluation (e.g., via sktime.base.clone/sklearn.base.clone) so comparisons are side‑effect free.