TimeSeriesCV is a subclass of TimeBasedSplit with default mode set to ‘backward’ and an optional split_limit to return the first n slices of time series cross-validation sets.
Parameters
Name
Type
Description
Default
frequency
str
The frequency (or time unit) of the time series. Must be one of “days”, “seconds”, “microseconds”, “milliseconds”, “minutes”, “hours”, “weeks”. These are the only valid values for the unit argument of timedelta from python datetime standard library.
required
train_size
int
Defines the minimum number of time units required to be in the train set.
required
forecast_horizon
int
Specifies the number of time units to forecast.
required
gap
int
Sets the number of time units to skip between the end of the train set and the start of the forecast set.
required
stride
int
How many time unit to move forward after each split. If None (or set to 0), the stride is equal to the forecast_horizon quantity.
0
window
str
The type of window to use, either “rolling” or “expanding”.
'rolling'
mode
ModeType
The mode to use for cross-validation. Default is ‘backward’.
'backward'
split_limit
int
The maximum number of splits to return. If not provided, all splits are returned.
None
Raises:
ValueError:
If frequency is not one of “days”, “seconds”, “microseconds”, “milliseconds”, “minutes”, “hours”, “weeks”.
If window is not one of “rolling” or “expanding”.
If mode is not one of “forward” or “backward”
If train_size, forecast_horizon, gap or stride are not strictly positive.
TypeError:
If train_size, forecast_horizon, gap or stride are not of type int.
Examples:
import pandas as pdimport numpy as npfrom pytimetk import TimeSeriesCVRNG = np.random.default_rng(seed=42)dates = pd.Series(pd.date_range("2023-01-01", "2023-01-31", freq="D"))size =len(dates)df = ( pd.concat( [ pd.DataFrame( {"time": pd.date_range(start, end, periods=_size, inclusive="left"),"a": RNG.normal(size=_size -1),"b": RNG.normal(size=_size -1), } )for start, end, _size inzip(dates[:-1], dates[1:], RNG.integers(2, 24, size -1)) ] ) .reset_index(drop=True) .assign(y=lambda t: t[["a", "b"]].sum(axis=1) + RNG.normal(size=t.shape[0]) /25))# Set indexdf.set_index("time", inplace=True)# Create an X dataframeand y seriesX, y = df.loc[:, ["a", "b"]], df["y"]# Initialize TimeSeriesCV with desired parameterstscv = TimeSeriesCV( frequency="days", train_size=10, forecast_horizon=5, gap=0, stride=0, split_limit=3# Limiting to 3 splits)tscv
TimeSeriesCV(
frequency_ = days
train_size_ = 10
forecast_horizon_ = 5
gap_ = 0
stride_ = 5
window_ = rolling
)
# Creates a split generatorsplits = tscv.split(X, y)for X_train, X_forecast, y_train, y_forecast in splits:print(X_train)print(X_forecast)
Prints summary information about the splits, focusing on the first two arrays.
Arguments: *arrays: The arrays to split. Only the first one will be used for summary information. time_series: The time series used for splitting. If not provided, the index of the first array is used. Default is None.
The arrays to split. Must have the same length as time_series.
()
time_series
SeriesLike[DateTimeLike]
The time series used to create boolean masks for splits. If not provided, the method will try to use the index of the first array (if it is a DataFrame or Series) as the time series.
None
start_dt
NullableDatetime
The start of the time period. If provided, it is used in place of time_series.min().
None
end_dt
NullableDatetime
The end of the time period. If provided, it is used in place of time_series.max().
None
return_splitstate
bool
Whether to return the SplitState instance for each split.
False
Returns:
A generator of tuples of arrays containing the training and forecast data. If split_limit is set, yields only up to split_limit splits.