TimeSeriesCV

TimeSeriesCV(self, frequency, train_size, forecast_horizon, gap, stride=0, window='rolling', mode='backward', split_limit=None, **kwargs)

TimeSeriesCV is a subclass of TimeBasedSplit with default mode set to ‘backward’ and an optional split_limit to return the first n slices of time series cross-validation sets.

Parameters

Name Type Description Default
frequency str The frequency (or time unit) of the time series. Must be one of “days”, “seconds”, “microseconds”, “milliseconds”, “minutes”, “hours”, “weeks”. These are the only valid values for the unit argument of timedelta from python datetime standard library. required
train_size int Defines the minimum number of time units required to be in the train set. required
forecast_horizon int Specifies the number of time units to forecast. required
gap int Sets the number of time units to skip between the end of the train set and the start of the forecast set. required
stride int How many time unit to move forward after each split. If None (or set to 0), the stride is equal to the forecast_horizon quantity. 0
window str The type of window to use, either “rolling” or “expanding”. 'rolling'
mode ModeType The mode to use for cross-validation. Default is ‘backward’. 'backward'
split_limit int The maximum number of splits to return. If not provided, all splits are returned. None

Raises:

ValueError:

  • If frequency is not one of “days”, “seconds”, “microseconds”, “milliseconds”, “minutes”, “hours”, “weeks”.
  • If window is not one of “rolling” or “expanding”.
  • If mode is not one of “forward” or “backward”
  • If train_size, forecast_horizon, gap or stride are not strictly positive.

TypeError:

If train_size, forecast_horizon, gap or stride are not of type int.

Examples:

import pandas as pd
import numpy as np
from pytimetk import TimeSeriesCV

RNG = np.random.default_rng(seed=42)

dates = pd.Series(pd.date_range("2023-01-01", "2023-01-31", freq="D"))
size = len(dates)

df = (
    pd.concat(
        [
            pd.DataFrame(
                {
                    "time": pd.date_range(start, end, periods=_size, inclusive="left"),
                    "a": RNG.normal(size=_size - 1),
                    "b": RNG.normal(size=_size - 1),
                }
            )
            for start, end, _size in zip(dates[:-1], dates[1:], RNG.integers(2, 24, size - 1))
        ]
    )
    .reset_index(drop=True)
    .assign(y=lambda t: t[["a", "b"]].sum(axis=1) + RNG.normal(size=t.shape[0]) / 25)
)

# Set index
df.set_index("time", inplace=True)

# Create an X dataframeand y series
X, y = df.loc[:, ["a", "b"]], df["y"]

# Initialize TimeSeriesCV with desired parameters
tscv = TimeSeriesCV(
    frequency="days",
    train_size=10,
    forecast_horizon=5,
    gap=0,
    stride=0,
    split_limit=3  # Limiting to 3 splits
)

tscv
TimeSeriesCV(
    frequency_ = days
    train_size_ = 10
    forecast_horizon_ = 5
    gap_ = 0
    stride_ = 5
    window_ = rolling
)
# Creates a split generator
splits = tscv.split(X, y)

for X_train, X_forecast, y_train, y_forecast in splits:
    print(X_train)
    print(X_forecast)
                            a         b
time                                   
2023-01-15 22:30:00 -0.743588 -1.602260
2023-01-16 00:00:00 -1.251647  0.196776
2023-01-16 01:20:00 -1.601278  0.820528
2023-01-16 02:40:00 -0.794136 -0.393741
2023-01-16 04:00:00  0.439637  0.521167
...                       ...       ...
2023-01-25 16:00:00  0.446322  2.549328
2023-01-25 17:20:00 -0.806599 -0.405269
2023-01-25 18:40:00 -1.282635 -1.936838
2023-01-25 20:00:00  0.713820 -0.310484
2023-01-25 21:20:00  0.241645 -0.286223

[127 rows x 2 columns]
                            a         b
time                                   
2023-01-25 22:40:00 -0.613977 -0.189924
2023-01-26 00:00:00 -1.113388  1.667888
2023-01-26 01:36:00  0.579561 -1.103741
2023-01-26 03:12:00  0.524507  0.587259
2023-01-26 04:48:00 -1.494406  0.319400
...                       ...       ...
2023-01-30 09:36:00 -0.299265  0.635371
2023-01-30 12:00:00 -1.015068  0.740014
2023-01-30 14:24:00  2.048756  0.636906
2023-01-30 16:48:00  1.785168  0.340791
2023-01-30 19:12:00  1.136049 -1.783611

[65 rows x 2 columns]
                            a         b
time                                   
2023-01-11 00:00:00  1.002758  1.876845
2023-01-11 02:00:00  0.538115 -0.853243
2023-01-11 04:00:00  1.337398 -0.287383
2023-01-11 06:00:00 -0.154506 -1.463442
2023-01-11 08:00:00 -0.695943 -0.590707
...                       ...       ...
2023-01-20 09:36:00  0.366531  0.895185
2023-01-20 12:00:00 -0.286249 -0.719480
2023-01-20 14:24:00  0.453966 -1.502503
2023-01-20 16:48:00 -0.308673 -2.964529
2023-01-20 19:12:00  0.935547 -0.543496

[145 rows x 2 columns]
                            a         b
time                                   
2023-01-20 21:36:00 -1.831406  2.420415
2023-01-21 00:00:00  0.434884  1.628937
2023-01-21 02:00:00 -0.559572 -0.970150
2023-01-21 04:00:00  0.465080 -0.887696
2023-01-21 06:00:00 -1.560958  1.335784
...                       ...       ...
2023-01-25 16:00:00  0.446322  2.549328
2023-01-25 17:20:00 -0.806599 -0.405269
2023-01-25 18:40:00 -1.282635 -1.936838
2023-01-25 20:00:00  0.713820 -0.310484
2023-01-25 21:20:00  0.241645 -0.286223

[65 rows x 2 columns]
                                      a         b
time                                             
2023-01-05 21:36:00.000000000  0.072130  0.835111
2023-01-06 00:00:00.000000000  0.356871 -0.812941
2023-01-06 01:15:47.368421052  1.463303 -0.415357
2023-01-06 02:31:34.736842105 -1.188763 -0.612097
2023-01-06 03:47:22.105263157 -0.639752 -0.140791
...                                 ...       ...
2023-01-15 15:00:00.000000000  0.383394 -2.003522
2023-01-15 16:30:00.000000000  0.999824  1.604254
2023-01-15 18:00:00.000000000 -1.058536 -0.457699
2023-01-15 19:30:00.000000000 -0.125009  0.107880
2023-01-15 21:00:00.000000000  1.481456  1.309551

[129 rows x 2 columns]
                            a         b
time                                   
2023-01-15 22:30:00 -0.743588 -1.602260
2023-01-16 00:00:00 -1.251647  0.196776
2023-01-16 01:20:00 -1.601278  0.820528
2023-01-16 02:40:00 -0.794136 -0.393741
2023-01-16 04:00:00  0.439637  0.521167
...                       ...       ...
2023-01-20 09:36:00  0.366531  0.895185
2023-01-20 12:00:00 -0.286249 -0.719480
2023-01-20 14:24:00  0.453966 -1.502503
2023-01-20 16:48:00 -0.308673 -2.964529
2023-01-20 19:12:00  0.935547 -0.543496

[62 rows x 2 columns]
# Also, you can use `glimpse()` to print summary information about the splits

tscv.glimpse(y)
Split Number: 1
Train Shape: (127,), Forecast Shape: (65,)
Train Period: 2023-01-15 22:30:00 to 2023-01-25 21:20:00
Forecast Period: 2023-01-25 22:40:00 to 2023-01-30 19:12:00

Split Number: 2
Train Shape: (145,), Forecast Shape: (65,)
Train Period: 2023-01-11 00:00:00 to 2023-01-20 19:12:00
Forecast Period: 2023-01-20 21:36:00 to 2023-01-25 21:20:00

Split Number: 3
Train Shape: (129,), Forecast Shape: (62,)
Train Period: 2023-01-05 21:36:00 to 2023-01-15 21:00:00
Forecast Period: 2023-01-15 22:30:00 to 2023-01-20 19:12:00
# You can also plot the splits by calling `plot()` on the `TimeSeriesCV` instance with the `y` Pandas series

tscv.plot(y)

Methods

Name Description
glimpse Prints summary information about the splits, focusing on the first two arrays.
plot Plots the cross-validation folds on a single plot with folds on the y-axis and dates on the x-axis using filled Scatter traces.
split Returns a generator of split arrays.

glimpse

TimeSeriesCV.glimpse(*arrays, time_series=None)

Prints summary information about the splits, focusing on the first two arrays.

Arguments: *arrays: The arrays to split. Only the first one will be used for summary information. time_series: The time series used for splitting. If not provided, the index of the first array is used. Default is None.

plot

TimeSeriesCV.plot(y, time_series=None, color_palette=None, bar_height=0.3, title='Time Series Cross-Validation Plot', x_lab='', y_lab='Fold', x_axis_date_labels=None, base_size=11, width=None, height=None, engine='plotly')

Plots the cross-validation folds on a single plot with folds on the y-axis and dates on the x-axis using filled Scatter traces.

Parameters

Name Type Description Default
y pd.Series The target time series as a pandas Series. required
time_series pd.Series The time series used for splitting. If not provided, the index of y is used. Default is None. None
color_palette Optional[Union[dict, list, str]] The color palette to use for the train and forecast. If not provided, the default colors are used. None
bar_height float The height of each bar in the plot. Default is 0.3. 0.3
title str The title of the plot. Default is “Time Series Cross-Validation Plot”. 'Time Series Cross-Validation Plot'
x_lab str The label for the x-axis. Default is ““. ''
y_lab str The label for the y-axis. Default is “Fold”. 'Fold'
x_axis_date_labels str The format of the date labels on the x-axis. Default is None. None
base_size float The base font size for the plot. Default is 11. 11
width Optional[int] The width of the plot in pixels. Default is None. None
height Optional[int] The height of the plot in pixels. Default is None. None
engine str The plotting engine to use. Default is “plotly”. 'plotly'

split

TimeSeriesCV.split(*arrays, time_series=None, start_dt=None, end_dt=None, return_splitstate=False)

Returns a generator of split arrays.

Parameters

Name Type Description Default
*arrays TL The arrays to split. Must have the same length as time_series. ()
time_series SeriesLike[DateTimeLike] The time series used to create boolean masks for splits. If not provided, the method will try to use the index of the first array (if it is a DataFrame or Series) as the time series. None
start_dt NullableDatetime The start of the time period. If provided, it is used in place of time_series.min(). None
end_dt NullableDatetime The end of the time period. If provided, it is used in place of time_series.max(). None
return_splitstate bool Whether to return the SplitState instance for each split. False

Returns:

A generator of tuples of arrays containing the training and forecast data. If split_limit is set, yields only up to split_limit splits.