TODO list:
- write narrative everywhere
- Need to define the difference between poisson, and gamma D2 scores.
- DONE write exercise instructions + code + placeholder + solution marker
- DONEmaybe try to swap RandomForestRegressor by XGBoost
- finish uncertainty section quantile regression as classification
- DONE extract plotting functions in a helper module
- DONE enable subsampling on the features and targets or maybe on the
prediction_timenode.
- Use dataframes / skrub to fetch and align time-structured data source to build exogenous features that are available for the forecast horizon of choice at the time of prediction.
- Use skrub expressions to be able to do model selection on the pipeline steps:
- lag variables included or not and lag amount
-
Comment on "system induced lag": time of prediction in deployment setting: most recent values might be missing in the system even if they show in historical data: in practice this means that we should create lag feature with a minimum lag of a few hours to leave time for recent measurements to reach the ML prediction system.
-
iceberg / deltalake can explicitly record system lag info in historical data.
-
- windowing aggregates included or not and window size
- weather features granularity
- calendar features
- holiday feature
- use a skrub choice tree + builtin random search
- lag variables included or not and lag amount
- Train a t+1 prediction model and evaluate it:
- Time-aware cross-validation.
- MSE/R2, Tweedie deviance or MAPIE.
- Lorenz curve
- reliability diagram
- Binned residual analysis
- models:
- HGBDR
- Exercise: pipeline with missing value support: SimpleImputer with
indicator, Spline, Nystroem, RidgeCV or TableVectorizer
- hyper tuning + per analysis of the CV results of the best model.
- Train a family of t+h direct models and evaluate them:
- Plot predictions at different time points.
- Compute per-horizon metrics + metrics integrated over all horizons of interests
- Show results as bar plots by h (one for R2, one for MAPE), compute mean with (min-max error bars).
- Consider models that are natively multioutput:
RandomForestRegressor(withmin_samples_leafto 30 min) or XGBoost multioutput vector leafs. - Alternatives to a family of t+h direct models:
- Recursive modeling: show limitations on synthetic data (show with mlforecast, darts or sktime)
- Use vector output models with concatenated future covariates.
- Pass h as an extra features and generate expanded datasets for many h values and concatenated future covariates?
- Quantify uncertainty in predictions with quantile regressors and evaluate them:
- Study pinball loss, coverage / width of uncertainty regressors + reliability diagrams + Lorenz curve
- Study if conformal predictions can improve upon this (optional)
- Show limitation of split conformal predictions:
- Show CQR.
- Non-exchangeable conformal prediction.
- Regression as probabilistic classification reduction.
- https://github.com/ogrisel/notebooks/blob/3a3d2321d4b81d0f089fd13aef96fd27745b505f/quantile_regression_as_classification.ipynb
- https://github.com/ogrisel/euroscipy-2022-time-series/blob/main/plot_time_series_feature_engineering.ipynb
- Auto-regressive sampling to sample from the joint future distribution.
- TabICL or TabPFN on calendar + exogenous features (without lag features).
- Dealing with drifts and trends via multiplicative preprocessing of the target.
- Making models with lagged features robust to random missing values by injecting missing data at training time (possibly by feature blocks).
- Using sample weights to deal with contiguous data quality problems.
-
Exercise: pipeline with missing value support: SimpleImputer with indicator, Spline, Nystroem, RidgeCV or TableVectorizer
- hyper tuning + per analysis of the CV results of the best model.
-
Adapt the main skrub pipeline to treat weather data as past covariates instead of future covariates.
-
Exercise: show how to use subsampling.
-
Exercise: custom splitter with metadata routing on datetime info: year-based splitting with year passed as feature.
- clean implementation would require making
SkrubPipelineimplement theget_metadata_routingmethod like thesklearn.pipeline.Pipelinedoes.
- clean implementation would require making
-
Exercise: Use a
sklearn.ensemble.RandomForestRegressorto handle multioutput horizon forecasts and show that it handles out of the box the multioutput problem and thus is faster than using asklearn.multioutput.MultiOutputRegressor.