* two DataSet (__getitem__):
* simple: x_num: [length x channels]
*
class MyDataset(Dataset):
def __init__(self, data:dict,t:np.array,groups:np.array,idx_target:Union[np.array,None],idx_target_future:Union[np.array,None])->torch.utils.data.Dataset:
"""
Extension of Dataset class. While training the returned item is a batch containing the standard keys
Args:
data (dict): a dictionary. Each key is a np.array containing the data. The keys are:
y : the target variable(s)
x_num_past: the numerical past variables
x_num_future: the numerical future variables
x_cat_past: the categorical past variables
x_cat_future: the categorical future variables
idx_target: index of target features in the past array
t (np.array): the time array related to the target variables
idx_target (Union[np.array,None]): you can specify the index in the past data that represent the input features (for differntial analysis or detrending strategies)
idx_target_future (Union[np.array,None]): you can specify the index in the future data that represent the input features (for differntial analysis or detrending strategies)
Returns:
torch.utils.data.Dataset: a torch Dataset to be used in a Dataloader
"""
def __getitem__(self, idxs):
sample = {}
for k in self.data:
sample[k] = self.data[k][idxs]
if self.idx_target is not None:
sample['idx_target'] = self.idx_target
if self.idx_target_future is not None:
sample['idx_target_future'] = self.idx_target_future
return sample
"""
Sampling via ``__getitem__`` returns a dictionary,
which always has following str-keyed entries:
y : (n_timepoints_future, n_targets)
x_num_past : (n_timepoints_past, n_targets + n_past_covariates_numerical)
x_num_future : (n_timepoints_future, n_future_numerical)
x_cat_past : (n_timepoints_past, n_past_covariates_categorical)
x_cat_future: (n_timepoints_future, n_future_covariates_categorical)
idx_target : list containing the column indexes of x_num_past corresponding to y
dsipts neural networks currently do not use t, so it is not passed!
"""
""" The input to `__init__` expects a dictionary:
y : (n_samples, n_timepoints_future, n_targets)
x_num_past : (n_samples, n_timepoints_past, n_targets + n_past_covariates_numerical)
x_num_future : (n_samples, n_timepoints_future, n_future_numerical)
x_cat_past : (n_samples, n_timepoints_past, n_past_covariates_categorical)
x_cat_future: (n_samples, n_timepoints_future, n_future_covariates_categorical)
t :
idx_target : list containing the column indexes of x_num_past corresponding to y
FK opinion: the resampling should be part of a pipeline to prepare a data loader.
class TimeSeries(Dataset):
"""PyTorch Dataset for storing raw time series from a pandas DataFrame.
This dataset follows the base raw time series dataset API in pytorch-forecasting.
A single sample corresponds to the i-th time series instance in the dataset.
Sampling via ``__getitem__`` returns a dictionary,
which always has following str-keyed entries:
* t: tensor of shape (n_timepoints)
Time index for each time point in the past or present. Aligned with ``y``,
and ``x`` not ending in ``f``.
* y: tensor of shape (n_timepoints, n_targets)
Target values for each time point. Rows are time points, aligned with ``t``.
Columns are targets, aligned with ``col_t``.
* x: tensor of shape (n_timepoints, n_features)
Features for each time point. Rows are time points, aligned with ``t``.
* group: tensor of shape (n_groups)
Group ids for time series instance.
* st: tensor of shape (n_static_features)
Static features.
* y_cols: list of str of length (n_targets)
Names of columns of ``y``, in same order as columns in ``y``.
* x_cols: list of str of length (n_features)
Names of columns of ``x``, in same order as columns in ``x``.
* st_cols: list of str of length (n_static_features)
Names of entries of ``st``, in same order as entries in ``st``.
* y_types: list of str of length (n_targets)
Types of columns of ``y``, in same order as columns in ``y``.
Types can be "c" for categorical, "n" for numerical.
* x_types: list of str of length (n_features)
Types of columns of ``x``, in same order as columns in ``x``.
Types can be "c" for categorical, "n" for numerical.
* st_types: list of str of length (n_static_features)
Types of entries of ``st``, in same order as entries in ``st``.
* x_k: list of int of length (n_features)
Whether the feature is known in the future, encoded by 0 or 1,
in same order as columns in ``x``.
0 means the feature is not known in the future, 1 means it is known.
Optionally, the following str-keyed entries can be included:
* t_f: tensor of shape (n_timepoints_future)
Time index for each time point in the future.
Aligned with ``x_f``.
* x_f: tensor of shape (n_timepoints_future, n_features)
Known features for each time point in the future.
Rows are time points, aligned with ``t_f``.
* weight: tensor of shape (n_timepoints), only if weight is not None
* weight_f: tensor of shape (n_timepoints_future), only if weight is not None
Parameters
----------
data : pd.DataFrame
data frame with sequence data.
Column names must all be str, and contain str as referred to below.
data_future : pd.DataFrame, optional, default=None
data rame with future data.
Column names must all be str, and contain str as referred to below.
May contain only columns that are in time, group, weight, known, or static.
time : str, optional, default = first col not in group_ids, weight, target, static.
integer typed column denoting the time index within ``data``.
This columns is used to determine the sequence of samples.
If there are no missings observations,
the time index should increase by ``+1`` for each subsequent sample.
The first time_idx for each series does not necessarily
have to be ``0`` but any value is allowed.
target : str or List[str], optional, default = last column (at iloc -1)
column(s) in ``data`` denoting the forecasting target.
Can be categorical or numerical dtype.
group : List[str], optional, default = None
list of column names identifying a time series instance within ``data``.
This means that the ``group`` together uniquely identify an instance,
and ``group`` together with ``time`` uniquely identify a single observation
within a time series instance.
If ``None``, the dataset is assumed to be a single time series.
weight : str, optional, default=None
column name for weights.
If ``None``, it is assumed that there is no weight column.
num : list of str, optional, default = all columns with dtype in "fi"
list of numerical variables in ``data``,
list may also contain list of str, which are then grouped together.
cat : list of str, optional, default = all columns with dtype in "Obc"
list of categorical variables in ``data``,
list may also contain list of str, which are then grouped together
(e.g. useful for product categories).
known : list of str, optional, default = all variables
list of variables that change over time and are known in the future,
list may also contain list of str, which are then grouped together
(e.g. useful for special days or promotion categories).
unknown : list of str, optional, default = no variables
list of variables that are not known in the future,
list may also contain list of str, which are then grouped together
(e.g. useful for weather categories).
static : list of str, optional, default = all variables not in known, unknown
list of variables that do not change over time,
list may also contain list of str, which are then grouped together.
"""
Discussion thread for API re-design for
pytorch.forecastingnext 1.X and towards 2.0. Comments appreciated from everyone!Link to enhancemeng proposal: sktime/enhancement-proposals#39
Context and goals
High-level directions:
pytorch-forecasting 2.0. We will need to homogenize interfaces, consolidate design ideas, and ensure downwards compatibility.thumlproject, also see [ENH] neural network libraries in thuml time-series-libraryΒ sktime#7243.sktime.High-level features for 2.0 with MoSCoW analysis:
sktimeand DSIPTS, but as closely to thepytorchlevel as possible. The API need not cover forecasters in general, onlytorchbased forecasters.skbasecan be used to curate the forecasters as records, with tags, etcthumlMeeting notes
Summary of discussion on Dec 20, 2024 and prior
FYI @agobbifbk, @thawn, @sktime/core-developers.
High-level directions:
pytorch-forecasting 2.0. We will need to homogenize interfaces, consolidate design ideas, and ensure downwards compatibility.thumlproject, also see [ENH] neural network libraries in thuml time-series-libraryΒ sktime#7243.sktime.High-level features for 2.0 with MoSCoW analysis:
sktimeand DSIPTS, but as closely to thepytorchlevel as possible. The API need not cover forecasters in general, onlytorchbased forecasters.skbasecan be used to curate the forecasters as records, with tags, etcthumlTodos:
0. update documentation on dsipts to signpost the above. README etc.
Roadmap planning Jan 15, 2025
π Attendees
Prioritization
πππππππππππππ
βοΈβοΈβοΈ
π¬ π¬ π¬
data layer - dataset, dataloader πππππππ π¬ βοΈβοΈβοΈ
model layer - base classes, configs, unified API ππππ βοΈ
foundation models, model hubs π π π
documentationπβοΈ
benchmarking π ππ¬
mlops and scaling (distributed, cluster etc)π π
more learning tasks supported
Tech meeting Jan 20, 2025
Attendees:
Agenda
__getitem__output convention__init__input convention(s)References
Umbrella issue design
#1736
Notes
number of classes, dataset, dataloader, "bottleneck" idea
AG: should be making it as modular as possible
TimeSeriesDataSetthat does everything)__init__- more memory intensive, clear distinction between train and inference; naive implementation needs to load everything in memory__getitem__time (dataset or dataloader) - feels this might be compute intensive, if we are recomputing and not caching etc__getitem__should be as general as possibleFHN:
__getitem__time, option 2.PB:
S:
T:
__getitem__protocolFK:
idea of "bottleneck" or "least common denominator" did not come up, surprised (came up before)
think we need at least one class, likely a
DataSetfor "raw time series" (collection of, with all metadata)Benedikt (not here today) also suggested this idea, and that
DataSet-s could depend on each othercurrent best guess for a structure:
DataSet-s, these inherit from a common base and handle pandas as well as hard drive dataDataSet-s, these could add re-sampling on top, normalization etcDataLoader-s, these are specific to data sets and classes of neural networksalternative structure
DataSet-s only have minimal representation of "time series"DataLoader-s that adapt data sets to neural networksT: one of the "final layer" classes - or middle layer classes - could be adapter to 1.X API of pytorch-forecasting (current), ensuring downwards compatibility.
FK: big question for me is how many "layers" to have, e.g., two dataset layers and one data loader layer, or single dataset layer and one data loader layer (where data loaders do more).
T: had assumed we will use standard pytorch dataloader - if that is the case, we will need two datasets for downwards compatibility.
FHN: if we keep using vanilla torch dataloader, we need two data set layers
* is this a contradiciton to the dataset to be "minimal"?
* FK: thinks not a contradiction, since there are two layers of datasets
* lower layer is "minimal" as discussed
* 2nd layer is specialized and specific to neural network(s)
FK: feels there is convergence but with two open questions:
(
__getitem__format to be handled in next agenda point)AG: there is one more complication - "stacked models", which are composites that use other models and their outputs to generate improved outputs
FK - we could have both options with a flag or two classes, this is really about internals of the class and does not impact
T: commenting about "stacked models"
strong opinions on using vanilla dataloader vs two dataset layers, vs custom dataloader and one dataset
__getitem__output conventionFHN: unsure
T: "as simple as possible"
dictand arrays (tensors etc) insideS: do we have a clear picture of what should be there?
T: would prefer pure tensors
Tech meeting Jan 24, 2025
Attendees:Notes
Recap
need to define dataset/dataloader layers
__init__, and "output API",__getitem__FK: suggest to focus on on output first
__getitem__designs based on last timesAG design suggestion
Current DSIPTS str
FK comments:
this looks like the top layer. It is closer to the "raw" or "bottleneck" layer, but it already has the data resampled.
The "sample" index is the first index in the input to
__init__.FK opinion: the resampling should be part of a pipeline to prepare a data loader.
So we have different artefacts
DataSet. Obtained from raw data via resampling/normalization utilityDataLoaderusing the output of__getitem__observation:
pytorch-forecasting covers A-C in single DataSet
DSIPTS covers B-C in single DataSet, and A-B in utilities
FK: last time, agreed we should have two layers DataSet
but none of current solutions has the "bottleneck" layer
ptf should take DataSet instead of DataFrame
DSIPTS has A-B outside torch idiomatic structures
alternatively, we could have a custom class handle conversions up to the dataloader format, or the input required for the dataset closest to the model
FK design suggestion