Forecasting with modeltime.h2o
made easy! This short tutorial shows how you can use:
H2O AutoML for forecasting implemented via automl_reg()
. This function trains and cross-validates multiple machine learning and deep learning models (XGBoost GBM, GLMs, Random Forest, GBMs…) and then trains two Stacked Ensembled models, one of all the models, and one of only the best models of each kind. Finally, the best model is selected based on a stopping metric. And we take care of all this for you!
Save & Load Models functionality to ensure the persistence of your models.
Next, we load the walmart_sales_weekly
data containing 7 time series and visualize them using the timetk::plot_time_series()
function.
data_tbl <- walmart_sales_weekly %>%
select(id, Date, Weekly_Sales)
data_tbl %>%
group_by(id) %>%
plot_time_series(
.date_var = Date,
.value = Weekly_Sales,
.facet_ncol = 2,
.smooth = F,
.interactive = F
)
Then, we separate the data with the time_series_split()
function and generate a training dataset and a test one.
splits <- time_series_split(data_tbl, assess = "3 month", cumulative = TRUE)
recipe_spec <- recipe(Weekly_Sales ~ ., data = training(splits)) %>%
step_timeseries_signature(Date)
train_tbl <- training(splits) %>% bake(prep(recipe_spec), .)
test_tbl <- testing(splits) %>% bake(prep(recipe_spec), .)
In order to correctly use modeltime.h2o, it is necessary to connect to an H2O cluster through the h2o.init()
function. You can find more information on how to set up the cluster by typing ?h2o.init
or by visiting the official site.
# Initialize H2O
h2o.init(
nthreads = -1,
ip = 'localhost',
port = 54321
)
#>
#> H2O is not running yet, starting it now...
#>
#> Note: In case of errors look at the following log files:
#> /var/folders/st/s5vwv9pd27g7z2sffwlqmtv80000gn/T//RtmpOAzLqf/file60373f953c09/h2o_mdancho_started_from_r.out
#> /var/folders/st/s5vwv9pd27g7z2sffwlqmtv80000gn/T//RtmpOAzLqf/file60375e98937b/h2o_mdancho_started_from_r.err
#>
#>
#> Starting H2O JVM and connecting: ... Connection successful!
#>
#> R is connected to the H2O cluster:
#> H2O cluster uptime: 2 seconds 616 milliseconds
#> H2O cluster timezone: America/New_York
#> H2O data parsing timezone: UTC
#> H2O cluster version: 3.32.0.1
#> H2O cluster version age: 5 months and 27 days !!!
#> H2O cluster name: H2O_started_from_R_mdancho_ueg906
#> H2O cluster total nodes: 1
#> H2O cluster total memory: 8.00 GB
#> H2O cluster total cores: 12
#> H2O cluster allowed cores: 12
#> H2O cluster healthy: TRUE
#> H2O Connection ip: localhost
#> H2O Connection port: 54321
#> H2O Connection proxy: NA
#> H2O Internal Security: FALSE
#> H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
#> R Version: R version 4.0.2 (2020-06-22)
# Optional - Set H2O No Progress to remove progress bars
h2o.no_progress()
Now comes the fun part! We define our model specification with the automl_reg()
function and pass the arguments through the engine:
model_spec <- automl_reg(mode = 'regression') %>%
set_engine(
engine = 'h2o',
max_runtime_secs = 5,
max_runtime_secs_per_model = 3,
max_models = 3,
nfolds = 5,
exclude_algos = c("DeepLearning"),
verbosity = NULL,
seed = 786
)
model_spec
#> H2O AutoML Model Specification (regression)
#>
#> Engine-Specific Arguments:
#> max_runtime_secs = 5
#> max_runtime_secs_per_model = 3
#> max_models = 3
#> nfolds = 5
#> exclude_algos = c("DeepLearning")
#> verbosity = NULL
#> seed = 786
#>
#> Computational engine: h2o
Next, let’s train the model with fit()
!
model_fitted <- model_spec %>%
fit(Weekly_Sales ~ ., data = train_tbl)
#> model_id mean_residual_deviance
#> 1 StackedEnsemble_AllModels_AutoML_20210405_115904 38501320
#> 2 XGBoost_3_AutoML_20210405_115904 42212631
#> 3 XGBoost_2_AutoML_20210405_115904 58816361
#> 4 XGBoost_1_AutoML_20210405_115904 2369268925
#> rmse mse mae rmsle
#> 1 6204.943 38501320 3835.035 0.1444437
#> 2 6497.125 42212631 4096.153 0.1501578
#> 3 7669.183 58816361 4875.643 0.1673720
#> 4 48675.137 2369268925 40066.038 1.2850293
#>
#> [4 rows x 6 columns]
model_fitted
#> parsnip model object
#>
#> Fit time: 9s
#>
#> H2O AutoML - Stackedensemble
#> --------
#> Model: Model Details:
#> ==============
#>
#> H2ORegressionModel: stackedensemble
#> Model ID: StackedEnsemble_AllModels_AutoML_20210405_115904
#> Number of Base Models: 3
#>
#> Base Models (count by algorithm type):
#>
#> xgboost
#> 3
#>
#> Metalearner:
#>
#> Metalearner algorithm: glm
#> Metalearner cross-validation fold assignment:
#> Fold assignment scheme: AUTO
#> Number of folds: 5
#> Fold column: NULL
#> Metalearner hyperparameters:
#>
#>
#> H2ORegressionMetrics: stackedensemble
#> ** Reported on training data. **
#>
#> MSE: 22728954
#> RMSE: 4767.489
#> MAE: 3027.272
#> RMSLE: 0.1026567
#> Mean Residual Deviance : 22728954
#>
#>
#>
#> H2ORegressionMetrics: stackedensemble
#> ** Reported on cross-validation data. **
#> ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
#>
#> MSE: 38501320
#> RMSE: 6204.943
#> MAE: 3835.035
#> RMSLE: 0.1444437
#> Mean Residual Deviance : 38501320
The best models are stored in the leaderbord
and by default the one with the best metric with which you have decided to sort the leaderbord is selected (this behavior can be controlled with the sort_metric
parameter passed through set_engine. For more information see ?h2o.automl. By default, it is sorted by the mean_residual_deviance). To list the models created during training that have finally been stored in the leaderbord you can use the automl_leaderbord
function as follows:
automl_leaderboard(model_fitted)
#> # A tibble: 4 x 6
#> model_id mean_residual_devi… rmse mse mae rmsle
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 StackedEnsemble_AllModels_Au… 38501320. 6205. 3.85e7 3835. 0.144
#> 2 XGBoost_3_AutoML_20210405_11… 42212631. 6497. 4.22e7 4096. 0.150
#> 3 XGBoost_2_AutoML_20210405_11… 58816361. 7669. 5.88e7 4876. 0.167
#> 4 XGBoost_1_AutoML_20210405_11… 2369268925. 48675. 2.37e9 40066. 1.29
To change the default selected model (remember, the first one sorted according to the selected metric) you can do it with the automl_update_model()
function as follows (do not run the following example as the model id name will have changed as there is randomness in the process):
automl_update_model(model_fitted, model_id = "StackedEnsemble_AllModels_AutoML_20210319_204825")
Finally, we predict()
on the test dataset:
predict(model_fitted, test_tbl)
#> # A tibble: 84 x 1
#> .pred
#> <dbl>
#> 1 19438.
#> 2 31548.
#> 3 37335.
#> 4 40876.
#> 5 77931.
#> 6 79811.
#> 7 132721.
#> 8 19065.
#> 9 36439.
#> 10 36364.
#> # … with 74 more rows
Once we have our fitted model, we can follow the Modeltime Workflow:
Add fitted models to a Model Table.
Calibrate the models to a testing set.
Perform Testing Set Forecast Evaluation & Accuracy Evaluation.
Refit the models to Full Dataset & Forecast Forward
First, we create the model table:
modeltime_tbl <- modeltime_table(
model_fitted
)
modeltime_tbl
#> # Modeltime Table
#> # A tibble: 1 x 3
#> .model_id .model .model_desc
#> <int> <list> <chr>
#> 1 1 <fit[+]> H2O AUTOML - STACKEDENSEMBLE
Next, we calibrate to the testing set and visualize the forecasts:
modeltime_tbl %>%
modeltime_calibrate(test_tbl) %>%
modeltime_forecast(
new_data = test_tbl,
actual_data = data_tbl,
keep_data = TRUE
) %>%
group_by(id) %>%
plot_modeltime_forecast(
.facet_ncol = 2,
.interactive = FALSE
)
Before using refit on our dataset, let’s prepare our data. We create data_prepared_tbl
which represents the complete dataset (the union of train and test) with the variables created with the recipe named recipe_spec. Subsequently, we create the dataset future_prepared_tbl
that represents the dataset with the future data to one year and the required variables.
data_prepared_tbl <- bind_rows(train_tbl, test_tbl)
future_tbl <- data_prepared_tbl %>%
group_by(id) %>%
future_frame(.length_out = "1 year") %>%
ungroup()
future_prepared_tbl <- bake(prep(recipe_spec), future_tbl)
Finally, we use forecast in our future dataset and visualize the results once we had reffited.
refit_tbl <- modeltime_tbl %>%
modeltime_refit(data_prepared_tbl)
#> model_id mean_residual_deviance
#> 1 StackedEnsemble_AllModels_AutoML_20210405_115918 45119504
#> 2 XGBoost_2_AutoML_20210405_115918 128393878
#> 3 XGBoost_1_AutoML_20210405_115918 156236498
#> 4 XGBoost_3_AutoML_20210405_115918 265914814
#> rmse mse mae rmsle
#> 1 6717.105 45119504 4409.966 0.1535976
#> 2 11331.102 128393878 7725.048 0.1921754
#> 3 12499.460 156236498 8734.104 0.2435574
#> 4 16306.895 265914814 11719.747 0.2776191
#>
#> [4 rows x 6 columns]
refit_tbl %>%
modeltime_forecast(
new_data = future_prepared_tbl,
actual_data = data_prepared_tbl,
keep_data = TRUE
) %>%
group_by(id) %>%
plot_modeltime_forecast(
.facet_ncol = 2,
.interactive = FALSE
)
We can likely do better than this if we train longer but really good for a quick example!
H2O models will need to “serialized” (a fancy word for saved to a directory that contains the recipe for recreating the models). To save the models, use save_h2o_model()
.
model_fitted %>%
save_h2o_model(path = "../model_fitted", overwrite = TRUE)
You can reload the model into R using load_h2o_model()
.
model_h2o <- load_h2o_model(path = "../model_fitted/")
Finally, once we have saved the specific models that we want to keep, we shutdown the H2O cluster.
h2o.shutdown(prompt = FALSE)
Need to learn high-performance time series forecasting?
Become the forecasting expert for your organization
High-Performance Time Series Course
Time series is changing. Businesses now need 10,000+ time series forecasts every day. This is what I call a High-Performance Time Series Forecasting System (HPTSF) - Accurate, Robust, and Scalable Forecasting.
High-Performance Forecasting Systems will save companies by improving accuracy and scalability. Imagine what will happen to your career if you can provide your organization a “High-Performance Time Series Forecasting System” (HPTSF System).
I teach how to build a HPTFS System in my High-Performance Time Series Forecasting Course. You will learn:
Modeltime
- 30+ Models (Prophet, ARIMA, XGBoost, Random Forest, & many more)GluonTS
(Competition Winners)Become the Time Series Expert for your organization.