TimeSeriesCV

TimeSeriesCV(
    self,
    frequency,
    train_size,
    forecast_horizon,
    gap,
    stride=0,
    window='rolling',
    mode='backward',
    split_limit=None,
    **kwargs,
)

TimeSeriesCV is a subclass of TimeBasedSplit with default mode set to ‘backward’ and an optional split_limit to return the first n slices of time series cross-validation sets.

Parameters

Name	Type	Description	Default
frequency	str	The frequency (or time unit) of the time series. Must be one of “days”, “seconds”, “microseconds”, “milliseconds”, “minutes”, “hours”, “weeks”, “months” or “years”. These are the valid values for the `unit` argument of `relativedelta` from python `dateutil` library.	required
train_size	int	Defines the minimum number of time units required to be in the train set.	required
forecast_horizon	int	Specifies the number of time units to forecast.	required
gap	int	Sets the number of time units to skip between the end of the train set and the start of the forecast set.	required
stride	int	How many time unit to move forward after each split. If `None` (or set to 0), the stride is equal to the `forecast_horizon` quantity.	`0`
window	str	The type of window to use, either “rolling” or “expanding”.	`'rolling'`
mode	ModeType	The mode to use for cross-validation. Default is ‘backward’.	`'backward'`
split_limit	int	The maximum number of splits to return. If not provided, all splits are returned.	`None`

Raises:

ValueError:

If frequency is not one of “days”, “seconds”, “microseconds”, “milliseconds”, “minutes”, “hours”, “weeks”.
If window is not one of “rolling” or “expanding”.
If mode is not one of “forward” or “backward”
If train_size, forecast_horizon, gap or stride are not strictly positive.

TypeError:

If train_size, forecast_horizon, gap or stride are not of type int.

Examples:

import pandas as pd
import numpy as np
from pytimetk import TimeSeriesCV

RNG = np.random.default_rng(seed=42)

dates = pd.Series(pd.date_range("2023-01-01", "2023-01-31", freq="D"))
size = len(dates)

df = (
    pd.concat(
        [
            pd.DataFrame(
                {
                    "time": pd.date_range(start, end, periods=_size, inclusive="left"),
                    "a": RNG.normal(size=_size - 1),
                    "b": RNG.normal(size=_size - 1),
                }
            )
            for start, end, _size in zip(dates[:-1], dates[1:], RNG.integers(2, 24, size - 1))
        ]
    )
    .reset_index(drop=True)
    .assign(y=lambda t: t[["a", "b"]].sum(axis=1) + RNG.normal(size=t.shape[0]) / 25)
)

# Set index
df.set_index("time", inplace=True)

# Create an X dataframeand y series
X, y = df.loc[:, ["a", "b"]], df["y"]

# Initialize TimeSeriesCV with desired parameters
tscv = TimeSeriesCV(
    frequency="days",
    train_size=10,
    forecast_horizon=5,
    gap=0,
    stride=0,
    split_limit=3  # Limiting to 3 splits
)

tscv

TimeSeriesCV(
    frequency_ = days
    train_size_ = 10
    forecast_horizon_ = 5
    gap_ = 0
    stride_ = 5
    window_ = rolling
)

# Creates a split generator
splits = tscv.split(X, y)

for X_train, X_forecast, y_train, y_forecast in splits:
    print(X_train)
    print(X_forecast)

                            a         b
time                                   
2023-01-15 22:30:00 -0.743588 -1.602260
2023-01-16 00:00:00 -1.251647  0.196776
2023-01-16 01:20:00 -1.601278  0.820528
2023-01-16 02:40:00 -0.794136 -0.393741
2023-01-16 04:00:00  0.439637  0.521167
...                       ...       ...
2023-01-25 16:00:00  0.446322  2.549328
2023-01-25 17:20:00 -0.806599 -0.405269
2023-01-25 18:40:00 -1.282635 -1.936838
2023-01-25 20:00:00  0.713820 -0.310484
2023-01-25 21:20:00  0.241645 -0.286223

[127 rows x 2 columns]
                            a         b
time                                   
2023-01-25 22:40:00 -0.613977 -0.189924
2023-01-26 00:00:00 -1.113388  1.667888
2023-01-26 01:36:00  0.579561 -1.103741
2023-01-26 03:12:00  0.524507  0.587259
2023-01-26 04:48:00 -1.494406  0.319400
...                       ...       ...
2023-01-30 09:36:00 -0.299265  0.635371
2023-01-30 12:00:00 -1.015068  0.740014
2023-01-30 14:24:00  2.048756  0.636906
2023-01-30 16:48:00  1.785168  0.340791
2023-01-30 19:12:00  1.136049 -1.783611

[65 rows x 2 columns]
                            a         b
time                                   
2023-01-11 00:00:00  1.002758  1.876845
2023-01-11 02:00:00  0.538115 -0.853243
2023-01-11 04:00:00  1.337398 -0.287383
2023-01-11 06:00:00 -0.154506 -1.463442
2023-01-11 08:00:00 -0.695943 -0.590707
...                       ...       ...
2023-01-20 09:36:00  0.366531  0.895185
2023-01-20 12:00:00 -0.286249 -0.719480
2023-01-20 14:24:00  0.453966 -1.502503
2023-01-20 16:48:00 -0.308673 -2.964529
2023-01-20 19:12:00  0.935547 -0.543496

[145 rows x 2 columns]
                            a         b
time                                   
2023-01-20 21:36:00 -1.831406  2.420415
2023-01-21 00:00:00  0.434884  1.628937
2023-01-21 02:00:00 -0.559572 -0.970150
2023-01-21 04:00:00  0.465080 -0.887696
2023-01-21 06:00:00 -1.560958  1.335784
...                       ...       ...
2023-01-25 16:00:00  0.446322  2.549328
2023-01-25 17:20:00 -0.806599 -0.405269
2023-01-25 18:40:00 -1.282635 -1.936838
2023-01-25 20:00:00  0.713820 -0.310484
2023-01-25 21:20:00  0.241645 -0.286223

[65 rows x 2 columns]
                                      a         b
time                                             
2023-01-05 21:36:00.000000000  0.072130  0.835111
2023-01-06 00:00:00.000000000  0.356871 -0.812941
2023-01-06 01:15:47.368421052  1.463303 -0.415357
2023-01-06 02:31:34.736842105 -1.188763 -0.612097
2023-01-06 03:47:22.105263157 -0.639752 -0.140791
...                                 ...       ...
2023-01-15 15:00:00.000000000  0.383394 -2.003522
2023-01-15 16:30:00.000000000  0.999824  1.604254
2023-01-15 18:00:00.000000000 -1.058536 -0.457699
2023-01-15 19:30:00.000000000 -0.125009  0.107880
2023-01-15 21:00:00.000000000  1.481456  1.309551

[129 rows x 2 columns]
                            a         b
time                                   
2023-01-15 22:30:00 -0.743588 -1.602260
2023-01-16 00:00:00 -1.251647  0.196776
2023-01-16 01:20:00 -1.601278  0.820528
2023-01-16 02:40:00 -0.794136 -0.393741
2023-01-16 04:00:00  0.439637  0.521167
...                       ...       ...
2023-01-20 09:36:00  0.366531  0.895185
2023-01-20 12:00:00 -0.286249 -0.719480
2023-01-20 14:24:00  0.453966 -1.502503
2023-01-20 16:48:00 -0.308673 -2.964529
2023-01-20 19:12:00  0.935547 -0.543496

[62 rows x 2 columns]

# Also, you can use `glimpse()` to print summary information about the splits

tscv.glimpse(y)

Split Number: 1
Train Shape: (127,), Forecast Shape: (65,)
Train Period: 2023-01-15 22:30:00 to 2023-01-25 21:20:00
Forecast Period: 2023-01-25 22:40:00 to 2023-01-30 19:12:00

Split Number: 2
Train Shape: (145,), Forecast Shape: (65,)
Train Period: 2023-01-11 00:00:00 to 2023-01-20 19:12:00
Forecast Period: 2023-01-20 21:36:00 to 2023-01-25 21:20:00

Split Number: 3
Train Shape: (129,), Forecast Shape: (62,)
Train Period: 2023-01-05 21:36:00 to 2023-01-15 21:00:00
Forecast Period: 2023-01-15 22:30:00 to 2023-01-20 19:12:00

# You can also plot the splits by calling `plot()` on the `TimeSeriesCV` instance with the `y` Pandas series

tscv.plot(y)

Methods

Name	Description
glimpse	Prints summary information about the splits, focusing on the first two arrays.
plot	Plots the cross-validation folds on a single plot with folds on the y-axis and dates on the x-axis using filled Scatter traces.
split	Returns a generator of split arrays.

glimpse

TimeSeriesCV.glimpse(*arrays, time_series=None)

Prints summary information about the splits, focusing on the first two arrays.

Arguments: *arrays: The arrays to split. Only the first one will be used for summary information. time_series: The time series used for splitting. If not provided, the index of the first array is used. Default is None.

plot

TimeSeriesCV.plot(
    y,
    time_series=None,
    color_palette=None,
    bar_height=0.3,
    title='Time Series Cross-Validation Plot',
    x_lab='',
    y_lab='Fold',
    x_axis_date_labels=None,
    base_size=11,
    width=None,
    height=None,
    engine='plotly',
)

Plots the cross-validation folds on a single plot with folds on the y-axis and dates on the x-axis using filled Scatter traces.

Parameters

Name	Type	Description	Default
y	pd.Series	The target time series as a pandas Series.	required
time_series	pd.Series	The time series used for splitting. If not provided, the index of `y` is used. Default is None.	`None`
color_palette	Optional[Union[dict, list, str]]	The color palette to use for the train and forecast. If not provided, the default colors are used.	`None`
bar_height	float	The height of each bar in the plot. Default is 0.3.	`0.3`
title	str	The title of the plot. Default is “Time Series Cross-Validation Plot”.	`'Time Series Cross-Validation Plot'`
x_lab	str	The label for the x-axis. Default is ““.	`''`
y_lab	str	The label for the y-axis. Default is “Fold”.	`'Fold'`
x_axis_date_labels	str	The format of the date labels on the x-axis. Default is None.	`None`
base_size	float	The base font size for the plot. Default is 11.	`11`
width	Optional[int]	The width of the plot in pixels. Default is None.	`None`
height	Optional[int]	The height of the plot in pixels. Default is None.	`None`
engine	str	The plotting engine to use. Default is “plotly”.	`'plotly'`

split

TimeSeriesCV.split(
    *arrays,
    time_series=None,
    start_dt=None,
    end_dt=None,
    return_splitstate=False,
)

Returns a generator of split arrays.

Parameters

Name	Type	Description	Default
*arrays	TL	The arrays to split. Must have the same length as `time_series`.	`()`
time_series	SeriesLike[DateTimeLike]	The time series used to create boolean masks for splits. If not provided, the method will try to use the index of the first array (if it is a DataFrame or Series) as the time series.	`None`
start_dt	NullableDatetime	The start of the time period. If provided, it is used in place of `time_series.min()`.	`None`
end_dt	NullableDatetime	The end of the time period. If provided, it is used in place of `time_series.max()`.	`None`
return_splitstate	bool	Whether to return the `SplitState` instance for each split.	`False`

Returns:

A generator of tuples of arrays containing the training and forecast data. If split_limit is set, yields only up to split_limit splits.