Time Series Cross Validation

1 Time-Based Cross-Validation Using `TimeSeriesCV` and `TimeSeriesCVSplitter`

In this tutorial, you’ll learn how to use the TimeSeriesCV and TimeSeriesCVSplitter classes from pytimetk for time series cross-validation, using the walmart_sales_df dataset as an example, which contains 7 time series groups.

In Part 1, we’ll start with exploring the data and move on to creating and visualizing time-based cross-validation splits. This will prepare you for the next section with Scikit Learn.
In Part 2, we’ll implement time series cross-validation with Scikit-Learn, engineer features, train a random forest model, and visualize the results in Python. By following this process, you can ensure a robust evaluation of your time series models and gain insights into their predictive performance.

2 Part 1: Getting Started with `TimeSeriesCV`

TimeSeriesCV is used to generate many time series splits (or folds) for use in modeling and resampling with one or more time series groups contained in the data.

Using with Scikit Learn

If you are wanting a drop-in replacement for Scikit Learn’s TimeSeriesSplit, please use TimeSeriesCVSplitter() discussed next. The splitter uses TimeSeriesCV under the hood.

2.1 Step 1: Load and Explore the Data

First, let’s load the Walmart sales dataset and explore its structure:

# libraries
import pytimetk as tk
import pandas as pd
import numpy as np

# Import Data
walmart_sales_df = tk.load_dataset('walmart_sales_weekly')

walmart_sales_df['Date'] = pd.to_datetime(walmart_sales_df['Date'])

walmart_sales_df = walmart_sales_df[['id', 'Date', 'Weekly_Sales']]

walmart_sales_df.glimpse()

<class 'pandas.core.frame.DataFrame'>: 1001 rows of 3 columns
id:            object            ['1_1', '1_1', '1_1', '1_1', '1_1', '1_ ...
Date:          datetime64[ns]    [Timestamp('2010-02-05 00:00:00'), Time ...
Weekly_Sales:  float64           [24924.5, 46039.49, 41595.55, 19403.54, ...

2.2 Step 2: Visualize the Time Series Data

We can visualize the weekly sales data for different store IDs using the plot_timeseries method from pytimetk:

walmart_sales_df \
    .groupby('id') \
    .plot_timeseries(
        "Date", "Weekly_Sales",
        plotly_dropdown = True,
    )

This will generate an interactive time series plot, allowing you to explore sales data for different stores using a dropdown.

2.3 Step 3: Set Up `TimeSeriesCV` for Cross-Validation

Now, let’s set up a time-based cross-validation scheme using TimeSeriesCV:

from pytimetk.crossvalidation import TimeSeriesCV

# Define parameters for TimeSeriesCV
tscv = TimeSeriesCV(
    frequency="weeks",
    train_size=52,          # Use 52 weeks for training
    forecast_horizon=12,    # Forecast 12 weeks ahead
    gap=0,                  # No gap between training and forecast sets
    stride=4,               # Move forward by 4 weeks after each split
    window="rolling",       # Use a rolling window
    mode="backward"         # Generate splits from end to start
)

# Glimpse the cross-validation splits
tscv.glimpse(
    walmart_sales_df['Weekly_Sales'], 
    time_series=walmart_sales_df['Date']
)

Split Number: 1
Train Shape: (364,), Forecast Shape: (84,)
Train Period: 2011-08-05 00:00:00 to 2012-07-27 00:00:00
Forecast Period: 2012-08-03 00:00:00 to 2012-10-19 00:00:00

Split Number: 2
Train Shape: (364,), Forecast Shape: (84,)
Train Period: 2011-07-08 00:00:00 to 2012-06-29 00:00:00
Forecast Period: 2012-07-06 00:00:00 to 2012-09-21 00:00:00

Split Number: 3
Train Shape: (364,), Forecast Shape: (84,)
Train Period: 2011-06-10 00:00:00 to 2012-06-01 00:00:00
Forecast Period: 2012-06-08 00:00:00 to 2012-08-24 00:00:00

Split Number: 4
Train Shape: (364,), Forecast Shape: (84,)
Train Period: 2011-05-13 00:00:00 to 2012-05-04 00:00:00
Forecast Period: 2012-05-11 00:00:00 to 2012-07-27 00:00:00

Split Number: 5
Train Shape: (364,), Forecast Shape: (84,)
Train Period: 2011-04-15 00:00:00 to 2012-04-06 00:00:00
Forecast Period: 2012-04-13 00:00:00 to 2012-06-29 00:00:00

Split Number: 6
Train Shape: (364,), Forecast Shape: (84,)
Train Period: 2011-03-18 00:00:00 to 2012-03-09 00:00:00
Forecast Period: 2012-03-16 00:00:00 to 2012-06-01 00:00:00

Split Number: 7
Train Shape: (364,), Forecast Shape: (84,)
Train Period: 2011-02-18 00:00:00 to 2012-02-10 00:00:00
Forecast Period: 2012-02-17 00:00:00 to 2012-05-04 00:00:00

Split Number: 8
Train Shape: (364,), Forecast Shape: (84,)
Train Period: 2011-01-21 00:00:00 to 2012-01-13 00:00:00
Forecast Period: 2012-01-20 00:00:00 to 2012-04-06 00:00:00

Split Number: 9
Train Shape: (364,), Forecast Shape: (84,)
Train Period: 2010-12-24 00:00:00 to 2011-12-16 00:00:00
Forecast Period: 2011-12-23 00:00:00 to 2012-03-09 00:00:00

Split Number: 10
Train Shape: (364,), Forecast Shape: (84,)
Train Period: 2010-11-26 00:00:00 to 2011-11-18 00:00:00
Forecast Period: 2011-11-25 00:00:00 to 2012-02-10 00:00:00

Split Number: 11
Train Shape: (364,), Forecast Shape: (84,)
Train Period: 2010-10-29 00:00:00 to 2011-10-21 00:00:00
Forecast Period: 2011-10-28 00:00:00 to 2012-01-13 00:00:00

Split Number: 12
Train Shape: (364,), Forecast Shape: (84,)
Train Period: 2010-10-01 00:00:00 to 2011-09-23 00:00:00
Forecast Period: 2011-09-30 00:00:00 to 2011-12-16 00:00:00

Split Number: 13
Train Shape: (364,), Forecast Shape: (84,)
Train Period: 2010-09-03 00:00:00 to 2011-08-26 00:00:00
Forecast Period: 2011-09-02 00:00:00 to 2011-11-18 00:00:00

Split Number: 14
Train Shape: (364,), Forecast Shape: (84,)
Train Period: 2010-08-06 00:00:00 to 2011-07-29 00:00:00
Forecast Period: 2011-08-05 00:00:00 to 2011-10-21 00:00:00

Split Number: 15
Train Shape: (364,), Forecast Shape: (84,)
Train Period: 2010-07-09 00:00:00 to 2011-07-01 00:00:00
Forecast Period: 2011-07-08 00:00:00 to 2011-09-23 00:00:00

Split Number: 16
Train Shape: (364,), Forecast Shape: (84,)
Train Period: 2010-06-11 00:00:00 to 2011-06-03 00:00:00
Forecast Period: 2011-06-10 00:00:00 to 2011-08-26 00:00:00

Split Number: 17
Train Shape: (364,), Forecast Shape: (84,)
Train Period: 2010-05-14 00:00:00 to 2011-05-06 00:00:00
Forecast Period: 2011-05-13 00:00:00 to 2011-07-29 00:00:00

Split Number: 18
Train Shape: (364,), Forecast Shape: (84,)
Train Period: 2010-04-16 00:00:00 to 2011-04-08 00:00:00
Forecast Period: 2011-04-15 00:00:00 to 2011-07-01 00:00:00

Split Number: 19
Train Shape: (364,), Forecast Shape: (84,)
Train Period: 2010-03-19 00:00:00 to 2011-03-11 00:00:00
Forecast Period: 2011-03-18 00:00:00 to 2011-06-03 00:00:00

Split Number: 20
Train Shape: (364,), Forecast Shape: (84,)
Train Period: 2010-02-19 00:00:00 to 2011-02-11 00:00:00
Forecast Period: 2011-02-18 00:00:00 to 2011-05-06 00:00:00

The glimpse method provides a summary of each cross-validation fold, including the start and end dates of the training and forecast periods.

2.4 Step 4: Plot the Cross-Validation Splits

You can visualize how the data is split for training and testing:

# Plot the cross-validation splits
tscv.plot(
    walmart_sales_df['Weekly_Sales'], 
    time_series=walmart_sales_df['Date']
)

This plot will show each fold, illustrating which weeks are used for training and which weeks are used for forecasting.

3 Part 2: Using `TimeSeriesCVSplitter` for Model Evaluation with Scikit Learn

When evaluating a model’s predictive performance on time series data, we need to split the data in a way that respects the order of time within the Scikit Learn framework. We use a custom splitter, TimeSeriesCVSplitter, from the pytimetk library to handle this.

3.1 Step 1: Setting Up the `TimeSeriesCVSplitter`

The TimeSeriesCVSplitter helps us divide our dataset into training and forecast sets in a rolling window fashion. Here’s how we configure it:

from pytimetk.crossvalidation import TimeSeriesCVSplitter
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

# Set up TimeSeriesCVSplitter
cv_splitter = TimeSeriesCVSplitter(
    time_series=walmart_sales_df['Date'],
    frequency="weeks",
    train_size=52*2,
    forecast_horizon=12,
    gap=0,
    stride=4,
    window="rolling",
    mode="backward",
    split_limit = 5
)

# Visualize the TSCV Strategy
cv_splitter.splitter.plot(walmart_sales_df['Weekly_Sales'], walmart_sales_df['Date'])

The TimeSeriesCVSplitter creates multiple splits of the time series data, allowing us to validate the model across different periods. By visualizing the cross-validation strategy, we can see how the training and forecast sets are structured.

3.2 Step 2: Feature Engineering for Time Series Data

Effective feature engineering can significantly impact the performance of a time series model. Using pytimetk, we extract a variety of features from the Date column.

Generating Time Series Features

We use get_timeseries_signature to generate useful features, such as year, quarter, month, and day-of-week indicators.

# Prepare data for modeling

# Extract time series features from the 'Date' column
X_time_features = tk.get_timeseries_signature(walmart_sales_df['Date'])

# Select features to dummy encode
features_to_dummy = ['Date_quarteryear', 'Date_month_lbl', 'Date_wday_lbl', 'Date_am_pm']

# Dummy encode the selected features
X_time_dummies = pd.get_dummies(X_time_features[features_to_dummy], drop_first=True)

# Dummy encode the 'id' column
X_id_dummies = pd.get_dummies(walmart_sales_df['id'], prefix='store')

# Combine the time series features, dummy-encoded features, and the 'id' dummies
X = pd.concat([X_time_features, X_time_dummies, X_id_dummies], axis=1)

# Drop the original categorical columns that were dummy encoded
X = X.drop(columns=features_to_dummy).drop('Date', axis=1)

# Set the target variable
y = walmart_sales_df['Weekly_Sales'].values

3.3 Step 3: Model Training and Evaluation with Random Forest

For this example, we use RandomForestRegressor from scikit-learn to model the time series data. A random forest is a robust, ensemble-based model that can handle a wide range of regression tasks.

# Initialize the RandomForestRegressor model
model = RandomForestRegressor(
    n_estimators=100,      # Number of trees in the forest
    max_depth=None,        # Maximum depth of the trees (None means nodes are expanded until all leaves are pure)
    random_state=42        # Set a random state for reproducibility
)

# Evaluate the model using cross-validation scores
scores = cross_val_score(model, X, y, cv=cv_splitter, scoring='neg_mean_squared_error')

# Print cross-validation scores
print("Cross-Validation Scores (Negative MSE):", scores)

Cross-Validation Scores (Negative MSE): [-23761708.80112538 -23107644.58461143 -21728878.18790144
 -25113860.93913386 -86192034.48953015]

3.4 Step 4: Visualizing the Forecast

Visualization is crucial to understand how well the model predicts future values. We collect the actual and predicted values for each fold and combine them for easy plotting.

# Lists to store the combined data
combined_data = []

# Iterate through each fold and collect the data
for i, (train_index, test_index) in enumerate(cv_splitter.split(X, y), start=1):
    # Get the training and forecast data from the original DataFrame
    train_df = walmart_sales_df.iloc[train_index].copy()
    test_df = walmart_sales_df.iloc[test_index].copy()
    
    # Fit the model on the training data
    model.fit(X.iloc[train_index], y[train_index])
    
    # Predict on the test set
    y_pred = model.predict(X.iloc[test_index])
    
    # Add the actual and predicted values
    train_df['Actual'] = y[train_index]
    train_df['Predicted'] = None  # No predictions for training data
    train_df['Fold'] = i  # Indicate the current fold
    
    test_df['Actual'] = y[test_index]
    test_df['Predicted'] = y_pred  # Predictions for the test data
    test_df['Fold'] = i  # Indicate the current fold
    
    # Append both the training and forecast DataFrames to the combined data list
    combined_data.extend([train_df, test_df])

# Combine all the data into a single DataFrame
full_forecast_df = pd.concat(combined_data, ignore_index=True)

full_forecast_df = full_forecast_df[['id', 'Date', 'Actual', 'Predicted', 'Fold']]

full_forecast_df.glimpse()

<class 'pandas.core.frame.DataFrame'>: 4060 rows of 5 columns
id:         object            ['1_1', '1_1', '1_1', '1_1', '1_1', '1_1', ...
Date:       datetime64[ns]    [Timestamp('2010-08-06 00:00:00'), Timesta ...
Actual:     float64           [17508.41, 15536.4, 15740.13, 15793.87, 16 ...
Predicted:  float64           [nan, nan, nan, nan, nan, nan, nan, nan, n ...
Fold:       int64             [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...

Preparing Data for Visualization

To make the data easier to plot, we use pd.melt() to transform the Actual and Predicted columns into a long format.

# Melt the Actual and Predicted columns
melted_df = pd.melt(
    full_forecast_df,
    id_vars=['id', 'Date', 'Fold'],  # Columns to keep
    value_vars=['Actual', 'Predicted'],  # Columns to melt
    var_name='Type',  # Name for the new column indicating 'Actual' or 'Predicted'
    value_name='Value'  # Name for the new column with the values
)

melted_df["unique_id"] = "ID_" + melted_df['id'] + "-Fold_" + melted_df["Fold"].astype(str)

melted_df.glimpse()

<class 'pandas.core.frame.DataFrame'>: 8120 rows of 6 columns
id:         object            ['1_1', '1_1', '1_1', '1_1', '1_1', '1_1', ...
Date:       datetime64[ns]    [Timestamp('2010-08-06 00:00:00'), Timesta ...
Fold:       int64             [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Type:       object            ['Actual', 'Actual', 'Actual', 'Actual', ' ...
Value:      float64           [17508.41, 15536.4, 15740.13, 15793.87, 16 ...
unique_id:  object            ['ID_1_1-Fold_1', 'ID_1_1-Fold_1', 'ID_1_1 ...

Plotting the Forecasts

Finally, we use plot_timeseries() to visualize the forecasts, comparing the actual and predicted values for each fold.

melted_df \
    .groupby('unique_id') \
    .plot_timeseries(
        "Date", "Value",
        color_column = "Type",
        smooth=False, 
        plotly_dropdown=True
    )

4 Conclusion

This guide demonstrated how to implement time series cross-validation, engineer features, train a random forest model, and visualize the results in Python. By following this process, you can ensure a robust evaluation of your time series models and gain insights into their predictive performance. Happy modeling!

5 More Coming Soon…

We are in the early stages of development. But it’s obvious the potential for pytimetk now in Python. 🐍

Please ⭐ us on GitHub (it takes 2-seconds and means a lot).
To make requests, please see our Project Roadmap GH Issue #2. You can make requests there.
Want to contribute? See our contributing guide here.

1 Time-Based Cross-Validation Using TimeSeriesCV and TimeSeriesCVSplitter

2 Part 1: Getting Started with TimeSeriesCV

2.1 Step 1: Load and Explore the Data

2.2 Step 2: Visualize the Time Series Data

2.3 Step 3: Set Up TimeSeriesCV for Cross-Validation

2.4 Step 4: Plot the Cross-Validation Splits

3 Part 2: Using TimeSeriesCVSplitter for Model Evaluation with Scikit Learn

3.1 Step 1: Setting Up the TimeSeriesCVSplitter

3.2 Step 2: Feature Engineering for Time Series Data

Generating Time Series Features

3.3 Step 3: Model Training and Evaluation with Random Forest

3.4 Step 4: Visualizing the Forecast

Preparing Data for Visualization

Plotting the Forecasts

4 Conclusion

5 More Coming Soon…

1 Time-Based Cross-Validation Using `TimeSeriesCV` and `TimeSeriesCVSplitter`

2 Part 1: Getting Started with `TimeSeriesCV`

2.3 Step 3: Set Up `TimeSeriesCV` for Cross-Validation

3 Part 2: Using `TimeSeriesCVSplitter` for Model Evaluation with Scikit Learn

3.1 Step 1: Setting Up the `TimeSeriesCVSplitter`