Conversation
himax12
commented
Apr 18, 2026
- fix: add coverage parameter to fit_predict and fit_predict_async
- feat: add compare_estimators tool for agentic model selection
- docs: update installation docs to recommend uvx
This fix addresses issue sktime#176 where the coverage parameter was silently dropped in async workflows, preventing prediction intervals from being returned. Changes: - Add predict_interval() method to Executor class that uses sktime's predict_interval() for probabilistic forecasting - Add coverage parameter to Executor.fit_predict() - when provided, uses predict_interval() instead of predict() - Add coverage parameter to Executor.fit_predict_async() - same behavior - Update fit_predict_tool() to accept and pass coverage parameter - Update fit_predict_async_tool() to accept and pass coverage parameter - Update MCP server tool schemas for fit_predict and fit_predict_async to expose coverage parameter in the API The coverage parameter accepts: - A single float (e.g., 0.9 for 90% prediction intervals) - A list of floats for multiple interval levels (e.g., [0.5, 0.9, 0.95]) Only works with estimators that support prediction intervals (check capability:pred_int tag).
This implements issue sktime#178 - a new compare_estimators MCP tool that allows LLM agents to automatically compare multiple models on the same dataset and select the best one. Features: - Compare multiple estimator handles using cross-validation - Support for MAE, MAPE, MSE, RMSE, SMAPE, MASE, MedAE metrics - Works with both demo datasets and custom data_handle - Returns ranked results with best estimator The tool enables the full agentic loop: extract_ts_metadata() <- know what data looks like instantiate_estimator() x N <- create candidate models compare_estimators() <- NEW: rank all candidates automatically fit_predict(best_handle) <- execute with the winner Implementation: - Added _get_metric_instance() helper to map metric names to sktime classes - Added _extract_metric_value() to extract metrics from evaluate results - Added compare_estimators_tool() function - Registered tool in server.py with full MCP schema
This implements the feature request to use uvx as the recommended installation method for sktime-mcp. Changes: - README.md: Lead with uvx as the recommended approach, update Quick Start and Beginner Setup sections - docs/source/intro.md: Update Quick Start to lead with uvx - docs/source/user-guide.md: Update Getting Started to lead with uvx uvx provides a no-install experience similar to npx for Node.js, automatically handling environment isolation and dependencies. This removes the manual venv creation step from user-facing docs.
There was a problem hiding this comment.
Pull request overview
This PR extends the sktime-mcp server with prediction-interval support in fit/predict workflows, introduces a new tool to compare multiple estimators via cross-validation for agentic model selection, and updates documentation to recommend uvx for installation/running.
Changes:
- Add
coveragesupport tofit_predict/fit_predict_asyncto request prediction intervals. - Add
compare_estimatorstool for cross-validated ranking of multiple estimator handles. - Update README + docs to recommend
uvx-based usage and refresh install/run instructions.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| src/sktime_mcp/tools/fit_predict.py | Adds coverage parameter to tool APIs and updates docstring examples/return contract. |
| src/sktime_mcp/tools/evaluate.py | Adds metric mapping/helpers and new compare_estimators_tool implementation. |
| src/sktime_mcp/server.py | Registers compare_estimators tool and exposes coverage in tool schemas + dispatcher. |
| src/sktime_mcp/runtime/executor.py | Adds predict_interval and threads coverage through fit→predict sync/async paths. |
| docs/source/user-guide.md | Recommends uvx usage and updates client configuration examples. |
| docs/source/intro.md | Updates quickstart to recommend uvx and shows example client configs. |
| README.md | Reworks installation/running guidance around uvx and updates prerequisites text. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import logging | ||
| from typing import Any | ||
| from typing import Any, Optional, Union | ||
|
|
||
| from sktime.forecasting.model_evaluation import evaluate | ||
| from sktime.forecasting.model_selection import ExpandingWindowSplitter | ||
| from sktime.performance_metrics.forecasting import ( | ||
| MeanAbsoluteError, | ||
| MeanAbsolutePercentageError, | ||
| MeanSquaredError, | ||
| MedianAbsoluteError, | ||
| GeometricMeanAbsolutePercentageError, | ||
| ) |
There was a problem hiding this comment.
Union is imported but unused in this module, which will be flagged by linting (pyflakes/ruff F401). Remove the unused import (or use it if intended).
| initial_window = max(int(n * 0.5), n - cv_folds * 2) | ||
| if initial_window < 1: | ||
| initial_window = 1 | ||
| cv = ExpandingWindowSplitter( | ||
| initial_window=initial_window, | ||
| step_length=1, |
There was a problem hiding this comment.
The CV splitter setup can easily produce 0 folds (or raise) for typical inputs because initial_window is computed without accounting for horizon/max(fh). For example, with small n or larger horizon, initial_window + max(fh) can exceed n, so evaluate(...) returns empty results or errors. Adjust initial_window/step_length so that the requested cv_folds is feasible given n and horizon (and fail fast with a clear error when it isn’t).
| initial_window = max(int(n * 0.5), n - cv_folds * 2) | |
| if initial_window < 1: | |
| initial_window = 1 | |
| cv = ExpandingWindowSplitter( | |
| initial_window=initial_window, | |
| step_length=1, | |
| if horizon < 1: | |
| return { | |
| "success": False, | |
| "error": "horizon must be at least 1" | |
| } | |
| if cv_folds < 1: | |
| return { | |
| "success": False, | |
| "error": "cv_folds must be at least 1" | |
| } | |
| max_fh = horizon | |
| max_possible_folds = n - max_fh | |
| if max_possible_folds < 1: | |
| return { | |
| "success": False, | |
| "error": ( | |
| f"Time series is too short for horizon={horizon}: " | |
| f"need at least {horizon + 1} observations, got {n}" | |
| ), | |
| } | |
| if cv_folds > max_possible_folds: | |
| return { | |
| "success": False, | |
| "error": ( | |
| f"Requested cv_folds={cv_folds} is not feasible for " | |
| f"n={n} and horizon={horizon}; maximum is {max_possible_folds}" | |
| ), | |
| } | |
| target_initial_window = max(1, int(n * 0.5)) | |
| if cv_folds == 1: | |
| step_length = 1 | |
| initial_window = n - max_fh | |
| else: | |
| step_length = max( | |
| 1, | |
| (n - max_fh - target_initial_window) // (cv_folds - 1), | |
| ) | |
| initial_window = n - max_fh - step_length * (cv_folds - 1) | |
| cv = ExpandingWindowSplitter( | |
| initial_window=initial_window, | |
| step_length=step_length, |
| pred_intervals = instance.predict_interval(fh=fh, X=X, coverage=coverage) | ||
|
|
||
| # Convert to JSON-serializable format | ||
| # The result is a DataFrame with MultiIndex columns: (variable, coverage, lower/upper) | ||
| pred_intervals_copy = pred_intervals.copy() | ||
| pred_intervals_copy.index = pred_intervals_copy.index.astype(str) | ||
|
|
||
| # Extract intervals into a structured format | ||
| intervals_dict = {} | ||
| for col in pred_intervals_copy.columns: | ||
| var_name = col[0] if isinstance(col, tuple) else "predictions" | ||
| cov_level = col[1] if isinstance(col, tuple) else coverage | ||
| bound_type = col[2] if isinstance(col, tuple) else "bound" | ||
|
|
||
| key = f"{var_name}_{cov_level}" | ||
| if key not in intervals_dict: | ||
| intervals_dict[key] = {"coverage": cov_level} | ||
| intervals_dict[key][bound_type] = pred_intervals_copy[col].to_dict() | ||
|
|
||
| return { | ||
| "success": True, | ||
| "predictions": pred_intervals_copy.to_dict(), | ||
| "intervals": intervals_dict, | ||
| "horizon": len(fh) if hasattr(fh, "__len__") else fh, | ||
| "coverage": coverage, | ||
| } |
There was a problem hiding this comment.
predict_interval returns "predictions": pred_intervals_copy.to_dict(), but pred_intervals_copy contains lower/upper interval columns (often with MultiIndex/tuple column keys). This means callers expecting point forecasts will instead get interval data (and tuple keys will stringify during JSON sanitization). Consider returning point predictions separately (e.g., via instance.predict(...)) and keep interval bounds exclusively under intervals in a stable JSON-friendly structure.
| >>> fit_predict_tool("est_abc123", "airline", horizon=12, coverage=0.9) | ||
| { | ||
| "success": True, | ||
| "predictions": {...}, | ||
| "intervals": {"Airline": {"coverage": 0.9, "lower": {...}, "upper": {...}}}, | ||
| "horizon": 12, | ||
| "coverage": 0.9 |
There was a problem hiding this comment.
The fit_predict_tool doc example for intervals doesn’t match the structure returned by Executor.predict_interval (which currently keys intervals by something like <var>_<coverage> and returns interval columns under predictions). This mismatch will confuse tool users; update the example/Return contract (or adjust predict_interval output) so docs reflect the actual JSON shape.
| "default": 12, | ||
| }, | ||
| }, | ||
| "required": ["estimator_handles"], |
There was a problem hiding this comment.
compare_estimators tool schema doesn’t require either dataset or data_handle, but compare_estimators_tool fails unless one is provided. Tighten the schema (e.g., oneOf requiring exactly one of the two) so clients get validation errors instead of runtime errors.
| "required": ["estimator_handles"], | |
| "required": ["estimator_handles"], | |
| "oneOf": [ | |
| {"required": ["dataset"]}, | |
| {"required": ["data_handle"]}, | |
| ], |
| def _extract_metric_value(results: Any, metric_name: str) -> Optional[float]: | ||
| """Extract the metric value from evaluate results. | ||
|
|
||
| The column name format is 'test_<MetricName>' (e.g., 'test_MeanAbsolutePercentageError'). | ||
|
|
||
| Args: | ||
| results: DataFrame from sktime evaluate() | ||
| metric_name: Name of the metric to extract | ||
|
|
||
| Returns: | ||
| The metric value as a float, or None if not found | ||
| """ | ||
| # Convert column names to find the matching metric | ||
| metric_col = None | ||
| for col in results.columns: | ||
| if col == metric_name or col == f"test_{metric_name}": | ||
| metric_col = col | ||
| break | ||
| # Check case-insensitive match | ||
| col_lower = col.lower() | ||
| metric_lower = metric_name.lower() | ||
| if metric_lower in col_lower or col_lower == f"test_{metric_lower}": | ||
| metric_col = col | ||
| break |
There was a problem hiding this comment.
_extract_metric_value is unlikely to work with the metric abbreviations passed by compare_estimators_tool (e.g., "MAPE"). sktime.evaluate() typically emits columns like test_MeanAbsolutePercentageError (see existing test expectations), so searching for "MAPE" as a substring won't match and score will often be None, making comparisons fail. Consider extracting based on the scoring object’s name/class used in evaluate(...) (and/or building an explicit abbrev→full-column mapping) rather than fuzzy substring matching on the abbreviation.
| if not dataset and not data_handle: | ||
| return { | ||
| "success": False, | ||
| "error": "Either dataset or data_handle must be provided" | ||
| } | ||
|
|
||
| # Load data | ||
| if data_handle: | ||
| if data_handle not in executor._data_handles: | ||
| return { | ||
| "success": False, | ||
| "error": f"Unknown data handle: {data_handle}", | ||
| "available_handles": list(executor._data_handles.keys()), | ||
| } | ||
| data_info = executor._data_handles[data_handle] | ||
| y = data_info["y"] | ||
| X = data_info.get("X") | ||
| else: | ||
| data_result = executor.load_dataset(dataset) | ||
| if not data_result["success"]: | ||
| return data_result | ||
| y = data_result["data"] | ||
| X = data_result.get("exog") | ||
|
|
There was a problem hiding this comment.
compare_estimators_tool documentation/schema says "Use this or data_handle, not both", but the implementation allows both and silently prioritizes data_handle. This makes the API ambiguous for callers; it should reject requests where both are provided (or clearly document precedence).
| ), | ||
| }, | ||
| }, | ||
| "required": ["estimator_handle"], |
There was a problem hiding this comment.
fit_predict tool schema marks only estimator_handle as required, but fit_predict_tool requires dataset unless data_handle is provided. Because call_tool currently defaults missing dataset to an empty string, clients can send an invalid request that will fail at runtime. Consider using a oneOf schema to require exactly one of {dataset, data_handle} (and avoid defaulting dataset to "" in the dispatcher).
| "required": ["estimator_handle"], | |
| "oneOf": [ | |
| {"required": ["estimator_handle", "dataset"]}, | |
| {"required": ["estimator_handle", "data_handle"]}, | |
| ], |
| @@ -27,83 +27,90 @@ This MCP is **not** just documentation or static code analysis. It is a **semant | |||
| ## 🛠️ Prerequisites | |||
|
|
|||
| - **Python 3.10+** (3.9 is listed in `pyproject.toml` but the `mcp` package requires 3.10+) | |||
There was a problem hiding this comment.
README says Python 3.9 is listed in pyproject.toml, but pyproject.toml currently has requires-python = ">=3.10". This note is now inaccurate; please update/remove it to avoid confusing users about supported Python versions.
| - **Python 3.10+** (3.9 is listed in `pyproject.toml` but the `mcp` package requires 3.10+) | |
| - **Python 3.10+** |
| def compare_estimators_tool( | ||
| estimator_handles: list[str], | ||
| dataset: Optional[str] = None, | ||
| data_handle: Optional[str] = None, | ||
| metric: str = "MAPE", | ||
| cv_folds: int = 3, | ||
| horizon: int = 12, | ||
| ) -> dict[str, Any]: |
There was a problem hiding this comment.
New compare_estimators_tool behavior isn’t covered by tests (there are tests for evaluate_estimator_tool but none for this new tool). Add tests that cover: ranking order on a known dataset, handling of unknown handles, and metric extraction for the default metric (MAPE) to prevent regressions.