Skip to content

docs/uvx recommended setup#237

Open
himax12 wants to merge 3 commits intosktime:mainfrom
himax12:docs/uvx-recommended-setup
Open

docs/uvx recommended setup#237
himax12 wants to merge 3 commits intosktime:mainfrom
himax12:docs/uvx-recommended-setup

Conversation

@himax12
Copy link
Copy Markdown

@himax12 himax12 commented Apr 18, 2026

  • fix: add coverage parameter to fit_predict and fit_predict_async
  • feat: add compare_estimators tool for agentic model selection
  • docs: update installation docs to recommend uvx

himax12 added 3 commits April 18, 2026 12:22
This fix addresses issue sktime#176 where the coverage parameter was silently
dropped in async workflows, preventing prediction intervals from being
returned.

Changes:
- Add predict_interval() method to Executor class that uses sktime's
  predict_interval() for probabilistic forecasting
- Add coverage parameter to Executor.fit_predict() - when provided,
  uses predict_interval() instead of predict()
- Add coverage parameter to Executor.fit_predict_async() - same behavior
- Update fit_predict_tool() to accept and pass coverage parameter
- Update fit_predict_async_tool() to accept and pass coverage parameter
- Update MCP server tool schemas for fit_predict and fit_predict_async
  to expose coverage parameter in the API

The coverage parameter accepts:
- A single float (e.g., 0.9 for 90% prediction intervals)
- A list of floats for multiple interval levels (e.g., [0.5, 0.9, 0.95])

Only works with estimators that support prediction intervals (check
capability:pred_int tag).
This implements issue sktime#178 - a new compare_estimators MCP tool that allows
LLM agents to automatically compare multiple models on the same dataset
and select the best one.

Features:
- Compare multiple estimator handles using cross-validation
- Support for MAE, MAPE, MSE, RMSE, SMAPE, MASE, MedAE metrics
- Works with both demo datasets and custom data_handle
- Returns ranked results with best estimator

The tool enables the full agentic loop:
  extract_ts_metadata()       <- know what data looks like
  instantiate_estimator() x N <- create candidate models
  compare_estimators()         <- NEW: rank all candidates automatically
  fit_predict(best_handle)     <- execute with the winner

Implementation:
- Added _get_metric_instance() helper to map metric names to sktime classes
- Added _extract_metric_value() to extract metrics from evaluate results
- Added compare_estimators_tool() function
- Registered tool in server.py with full MCP schema
This implements the feature request to use uvx as the recommended
installation method for sktime-mcp.

Changes:
- README.md: Lead with uvx as the recommended approach, update Quick Start
  and Beginner Setup sections
- docs/source/intro.md: Update Quick Start to lead with uvx
- docs/source/user-guide.md: Update Getting Started to lead with uvx

uvx provides a no-install experience similar to npx for Node.js,
automatically handling environment isolation and dependencies.

This removes the manual venv creation step from user-facing docs.
Copilot AI review requested due to automatic review settings April 18, 2026 07:21
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the sktime-mcp server with prediction-interval support in fit/predict workflows, introduces a new tool to compare multiple estimators via cross-validation for agentic model selection, and updates documentation to recommend uvx for installation/running.

Changes:

  • Add coverage support to fit_predict / fit_predict_async to request prediction intervals.
  • Add compare_estimators tool for cross-validated ranking of multiple estimator handles.
  • Update README + docs to recommend uvx-based usage and refresh install/run instructions.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
src/sktime_mcp/tools/fit_predict.py Adds coverage parameter to tool APIs and updates docstring examples/return contract.
src/sktime_mcp/tools/evaluate.py Adds metric mapping/helpers and new compare_estimators_tool implementation.
src/sktime_mcp/server.py Registers compare_estimators tool and exposes coverage in tool schemas + dispatcher.
src/sktime_mcp/runtime/executor.py Adds predict_interval and threads coverage through fit→predict sync/async paths.
docs/source/user-guide.md Recommends uvx usage and updates client configuration examples.
docs/source/intro.md Updates quickstart to recommend uvx and shows example client configs.
README.md Reworks installation/running guidance around uvx and updates prerequisites text.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 7 to +18
import logging
from typing import Any
from typing import Any, Optional, Union

from sktime.forecasting.model_evaluation import evaluate
from sktime.forecasting.model_selection import ExpandingWindowSplitter
from sktime.performance_metrics.forecasting import (
MeanAbsoluteError,
MeanAbsolutePercentageError,
MeanSquaredError,
MedianAbsoluteError,
GeometricMeanAbsolutePercentageError,
)
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Union is imported but unused in this module, which will be flagged by linting (pyflakes/ruff F401). Remove the unused import (or use it if intended).

Copilot uses AI. Check for mistakes.
Comment on lines +243 to +248
initial_window = max(int(n * 0.5), n - cv_folds * 2)
if initial_window < 1:
initial_window = 1
cv = ExpandingWindowSplitter(
initial_window=initial_window,
step_length=1,
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CV splitter setup can easily produce 0 folds (or raise) for typical inputs because initial_window is computed without accounting for horizon/max(fh). For example, with small n or larger horizon, initial_window + max(fh) can exceed n, so evaluate(...) returns empty results or errors. Adjust initial_window/step_length so that the requested cv_folds is feasible given n and horizon (and fail fast with a clear error when it isn’t).

Suggested change
initial_window = max(int(n * 0.5), n - cv_folds * 2)
if initial_window < 1:
initial_window = 1
cv = ExpandingWindowSplitter(
initial_window=initial_window,
step_length=1,
if horizon < 1:
return {
"success": False,
"error": "horizon must be at least 1"
}
if cv_folds < 1:
return {
"success": False,
"error": "cv_folds must be at least 1"
}
max_fh = horizon
max_possible_folds = n - max_fh
if max_possible_folds < 1:
return {
"success": False,
"error": (
f"Time series is too short for horizon={horizon}: "
f"need at least {horizon + 1} observations, got {n}"
),
}
if cv_folds > max_possible_folds:
return {
"success": False,
"error": (
f"Requested cv_folds={cv_folds} is not feasible for "
f"n={n} and horizon={horizon}; maximum is {max_possible_folds}"
),
}
target_initial_window = max(1, int(n * 0.5))
if cv_folds == 1:
step_length = 1
initial_window = n - max_fh
else:
step_length = max(
1,
(n - max_fh - target_initial_window) // (cv_folds - 1),
)
initial_window = n - max_fh - step_length * (cv_folds - 1)
cv = ExpandingWindowSplitter(
initial_window=initial_window,
step_length=step_length,

Copilot uses AI. Check for mistakes.
Comment on lines +227 to +252
pred_intervals = instance.predict_interval(fh=fh, X=X, coverage=coverage)

# Convert to JSON-serializable format
# The result is a DataFrame with MultiIndex columns: (variable, coverage, lower/upper)
pred_intervals_copy = pred_intervals.copy()
pred_intervals_copy.index = pred_intervals_copy.index.astype(str)

# Extract intervals into a structured format
intervals_dict = {}
for col in pred_intervals_copy.columns:
var_name = col[0] if isinstance(col, tuple) else "predictions"
cov_level = col[1] if isinstance(col, tuple) else coverage
bound_type = col[2] if isinstance(col, tuple) else "bound"

key = f"{var_name}_{cov_level}"
if key not in intervals_dict:
intervals_dict[key] = {"coverage": cov_level}
intervals_dict[key][bound_type] = pred_intervals_copy[col].to_dict()

return {
"success": True,
"predictions": pred_intervals_copy.to_dict(),
"intervals": intervals_dict,
"horizon": len(fh) if hasattr(fh, "__len__") else fh,
"coverage": coverage,
}
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predict_interval returns "predictions": pred_intervals_copy.to_dict(), but pred_intervals_copy contains lower/upper interval columns (often with MultiIndex/tuple column keys). This means callers expecting point forecasts will instead get interval data (and tuple keys will stringify during JSON sanitization). Consider returning point predictions separately (e.g., via instance.predict(...)) and keep interval bounds exclusively under intervals in a stable JSON-friendly structure.

Copilot uses AI. Check for mistakes.
Comment on lines +51 to +57
>>> fit_predict_tool("est_abc123", "airline", horizon=12, coverage=0.9)
{
"success": True,
"predictions": {...},
"intervals": {"Airline": {"coverage": 0.9, "lower": {...}, "upper": {...}}},
"horizon": 12,
"coverage": 0.9
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fit_predict_tool doc example for intervals doesn’t match the structure returned by Executor.predict_interval (which currently keys intervals by something like <var>_<coverage> and returns interval columns under predictions). This mismatch will confuse tool users; update the example/Return contract (or adjust predict_interval output) so docs reflect the actual JSON shape.

Copilot uses AI. Check for mistakes.
Comment thread src/sktime_mcp/server.py
"default": 12,
},
},
"required": ["estimator_handles"],
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare_estimators tool schema doesn’t require either dataset or data_handle, but compare_estimators_tool fails unless one is provided. Tighten the schema (e.g., oneOf requiring exactly one of the two) so clients get validation errors instead of runtime errors.

Suggested change
"required": ["estimator_handles"],
"required": ["estimator_handles"],
"oneOf": [
{"required": ["dataset"]},
{"required": ["data_handle"]},
],

Copilot uses AI. Check for mistakes.
Comment on lines +69 to +92
def _extract_metric_value(results: Any, metric_name: str) -> Optional[float]:
"""Extract the metric value from evaluate results.

The column name format is 'test_<MetricName>' (e.g., 'test_MeanAbsolutePercentageError').

Args:
results: DataFrame from sktime evaluate()
metric_name: Name of the metric to extract

Returns:
The metric value as a float, or None if not found
"""
# Convert column names to find the matching metric
metric_col = None
for col in results.columns:
if col == metric_name or col == f"test_{metric_name}":
metric_col = col
break
# Check case-insensitive match
col_lower = col.lower()
metric_lower = metric_name.lower()
if metric_lower in col_lower or col_lower == f"test_{metric_lower}":
metric_col = col
break
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_extract_metric_value is unlikely to work with the metric abbreviations passed by compare_estimators_tool (e.g., "MAPE"). sktime.evaluate() typically emits columns like test_MeanAbsolutePercentageError (see existing test expectations), so searching for "MAPE" as a substring won't match and score will often be None, making comparisons fail. Consider extracting based on the scoring object’s name/class used in evaluate(...) (and/or building an explicit abbrev→full-column mapping) rather than fuzzy substring matching on the abbreviation.

Copilot uses AI. Check for mistakes.
Comment on lines +216 to +239
if not dataset and not data_handle:
return {
"success": False,
"error": "Either dataset or data_handle must be provided"
}

# Load data
if data_handle:
if data_handle not in executor._data_handles:
return {
"success": False,
"error": f"Unknown data handle: {data_handle}",
"available_handles": list(executor._data_handles.keys()),
}
data_info = executor._data_handles[data_handle]
y = data_info["y"]
X = data_info.get("X")
else:
data_result = executor.load_dataset(dataset)
if not data_result["success"]:
return data_result
y = data_result["data"]
X = data_result.get("exog")

Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare_estimators_tool documentation/schema says "Use this or data_handle, not both", but the implementation allows both and silently prioritizes data_handle. This makes the API ambiguous for callers; it should reject requests where both are provided (or clearly document precedence).

Copilot uses AI. Check for mistakes.
Comment thread src/sktime_mcp/server.py
),
},
},
"required": ["estimator_handle"],
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fit_predict tool schema marks only estimator_handle as required, but fit_predict_tool requires dataset unless data_handle is provided. Because call_tool currently defaults missing dataset to an empty string, clients can send an invalid request that will fail at runtime. Consider using a oneOf schema to require exactly one of {dataset, data_handle} (and avoid defaulting dataset to "" in the dispatcher).

Suggested change
"required": ["estimator_handle"],
"oneOf": [
{"required": ["estimator_handle", "dataset"]},
{"required": ["estimator_handle", "data_handle"]},
],

Copilot uses AI. Check for mistakes.
Comment thread README.md
@@ -27,83 +27,90 @@ This MCP is **not** just documentation or static code analysis. It is a **semant
## 🛠️ Prerequisites

- **Python 3.10+** (3.9 is listed in `pyproject.toml` but the `mcp` package requires 3.10+)
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README says Python 3.9 is listed in pyproject.toml, but pyproject.toml currently has requires-python = ">=3.10". This note is now inaccurate; please update/remove it to avoid confusing users about supported Python versions.

Suggested change
- **Python 3.10+** (3.9 is listed in `pyproject.toml` but the `mcp` package requires 3.10+)
- **Python 3.10+**

Copilot uses AI. Check for mistakes.
Comment on lines +163 to +170
def compare_estimators_tool(
estimator_handles: list[str],
dataset: Optional[str] = None,
data_handle: Optional[str] = None,
metric: str = "MAPE",
cv_folds: int = 3,
horizon: int = 12,
) -> dict[str, Any]:
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New compare_estimators_tool behavior isn’t covered by tests (there are tests for evaluate_estimator_tool but none for this new tool). Add tests that cover: ranking order on a known dataset, handling of unknown handles, and metric extraction for the default metric (MAPE) to prevent regressions.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants