vignettes/parallel-processing.Rmd
parallel-processing.Rmd
Train
modeltime
models at scale with parallel processing
Fitting many time series models can be an expensive process. To help speed up computation, modeltime
now includes parallel processing, which is support for high-performance computing by spreading the model fitting steps across multiple CPUs or clusters.
In this example, we go through a common Hyperparameter Tuning workflow that shows off the modeltime
parallel processing integration and support for workflowsets
from the tidymodels ecosystem.
The modeltime
package (>= 0.6.1) comes with parallel processing functionality.
Use of parallel_start()
and parallel_stop()
to simplify the parallel processing setup.
Use of create_model_grid()
to help generating parsnip
model specs from dials
parameter grids.
Use of modeltime_fit_workflowset()
for initial fitting many models in parallel using workflowsets
from the tidymodels
ecosystem.
Use of modeltime_refit()
to refit models in parallel.
Use of control_fit_workflowset()
and control_refit()
for controlling the fitting and refitting of many models.
Let’s go through a common Hyperparameter Tuning workflow that shows off the modeltime
parallel processing integration and support for workflowsets
from the tidymodels ecosystem.
Load the following libraries.
# Machine Learning
library(modeltime)
library(tidymodels)
library(workflowsets)
# Core
library(tidyverse)
library(timetk)
I’ll set up this tutorial to use two (2) cores.
modeltime
includes parallel_start()
. We can simply supply the number of cores we’d like to use.parallel::detectCores(logical = FALSE)
.We’ll use the walmart_sales_weeekly
dataset from timetk
. It has seven (7) time series that represent weekly sales demand by department.
dataset_tbl <- walmart_sales_weekly %>%
select(id, Date, Weekly_Sales)
dataset_tbl %>%
group_by(id) %>%
plot_time_series(
.date_var = Date,
.value = Weekly_Sales,
.facet_ncol = 2,
.interactive = FALSE
)
Use time_series_split()
to make a temporal split for all seven time series.
splits <- time_series_split(
dataset_tbl,
assess = "6 months",
cumulative = TRUE
)
splits %>%
tk_time_series_cv_plan() %>%
plot_time_series_cv_plan(Date, Weekly_Sales, .interactive = F)
Make a preprocessing recipe that generates time series features.
recipe_spec_1 <- recipe(Weekly_Sales ~ ., data = training(splits)) %>%
step_timeseries_signature(Date) %>%
step_rm(Date) %>%
step_normalize(Date_index.num) %>%
step_zv(all_predictors()) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE)
We’ll make 6 xgboost
model specifications using boost_tree()
and the “xgboost” engine. These will be combined with the recipe
from the previous step using a workflow_set()
in the next section.
We can vary the learn_rate
parameter to see it’s effect on forecast error.
# XGBOOST MODELS
model_spec_xgb_1 <- boost_tree(learn_rate = 0.001) %>%
set_engine("xgboost")
model_spec_xgb_2 <- boost_tree(learn_rate = 0.010) %>%
set_engine("xgboost")
model_spec_xgb_3 <- boost_tree(learn_rate = 0.100) %>%
set_engine("xgboost")
model_spec_xgb_4 <- boost_tree(learn_rate = 0.350) %>%
set_engine("xgboost")
model_spec_xgb_5 <- boost_tree(learn_rate = 0.500) %>%
set_engine("xgboost")
model_spec_xgb_6 <- boost_tree(learn_rate = 0.650) %>%
set_engine("xgboost")
You may notice that this is a lot of repeated code to adjust the learn_rate
. To simplify this process, we can use create_model_grid()
.
model_tbl <- tibble(
learn_rate = c(0.001, 0.010, 0.100, 0.350, 0.500, 0.650)
) %>%
create_model_grid(
f_model_spec = boost_tree,
engine_name = "xgboost",
mode = "regression"
)
model_tbl
#> # A tibble: 6 x 2
#> learn_rate .models
#> <dbl> <list>
#> 1 0.001 <spec[+]>
#> 2 0.01 <spec[+]>
#> 3 0.1 <spec[+]>
#> 4 0.35 <spec[+]>
#> 5 0.5 <spec[+]>
#> 6 0.65 <spec[+]>
We can extract the model list for use with our workflowset
next. This is the same result if we would have placed the manually generated 6 model specs into a list()
.
model_list <- model_tbl$.models
model_list
#> [[1]]
#> Boosted Tree Model Specification (regression)
#>
#> Main Arguments:
#> learn_rate = 0.001
#>
#> Computational engine: xgboost
#>
#>
#> [[2]]
#> Boosted Tree Model Specification (regression)
#>
#> Main Arguments:
#> learn_rate = 0.01
#>
#> Computational engine: xgboost
#>
#>
#> [[3]]
#> Boosted Tree Model Specification (regression)
#>
#> Main Arguments:
#> learn_rate = 0.1
#>
#> Computational engine: xgboost
#>
#>
#> [[4]]
#> Boosted Tree Model Specification (regression)
#>
#> Main Arguments:
#> learn_rate = 0.35
#>
#> Computational engine: xgboost
#>
#>
#> [[5]]
#> Boosted Tree Model Specification (regression)
#>
#> Main Arguments:
#> learn_rate = 0.5
#>
#> Computational engine: xgboost
#>
#>
#> [[6]]
#> Boosted Tree Model Specification (regression)
#>
#> Main Arguments:
#> learn_rate = 0.65
#>
#> Computational engine: xgboost
With the workflow_set()
function, we can combine the 6 xgboost models with the 1 recipe to return six (6) combinations of recipe and model specifications. These are currently untrained (unfitted).
model_wfset <- workflow_set(
preproc = list(
recipe_spec_1
),
models = model_list,
cross = TRUE
)
model_wfset
#> # A workflow set/tibble: 6 x 4
#> wflow_id info option result
#> <chr> <list> <list> <list>
#> 1 recipe_boost_tree_1 <tibble [1 × 4]> <wrkflw__ > <list [0]>
#> 2 recipe_boost_tree_2 <tibble [1 × 4]> <wrkflw__ > <list [0]>
#> 3 recipe_boost_tree_3 <tibble [1 × 4]> <wrkflw__ > <list [0]>
#> 4 recipe_boost_tree_4 <tibble [1 × 4]> <wrkflw__ > <list [0]>
#> 5 recipe_boost_tree_5 <tibble [1 × 4]> <wrkflw__ > <list [0]>
#> 6 recipe_boost_tree_6 <tibble [1 × 4]> <wrkflw__ > <list [0]>
We can train each of the combinations in parallel.
Each fitting function in modeltime
has a “control” function:
The control functions help the user control the verbosity (adding remarks while training) and set up parallel processing. We can see the output when verbose = TRUE
and allow_par = TRUE
.
allow_par: Whether or not the user has indicated that parallel processing should be used.
If the user has set up parallel processing externally, the clusters will be reused.
If the user has not set up parallel processing, the fitting (training) process will set up parallel processing internally and shutdown. Note that this is more expensive, and usually costs around 10-15 seconds to set up.
verbose: Will return important messages showing the progress of the fitting operation.
cores: The cores that the user has set up. Since we’ve already set up doParallel
to use 2 cores, the control recognizes this.
packages: The packages are packages that will be sent to each of the workers.
control_fit_workflowset(
verbose = TRUE,
allow_par = TRUE
)
#> workflowset control object
#> --------------------------
#> allow_par : TRUE
#> cores : 2
#> verbose : TRUE
#> packages : modeltime parsnip dplyr stats lubridate tidymodels timetk forcats stringr readr tidyverse yardstick workflowsets workflows tune tidyr tibble rsample recipes purrr modeldata infer ggplot2 dials scales broom graphics grDevices utils datasets methods base
We use the modeltime_fit_workflowset()
and control_fit_workflowset()
together to train the unfitted workflowset in parallel.
model_parallel_tbl <- model_wfset %>%
modeltime_fit_workflowset(
data = training(splits),
control = control_fit_workflowset(
verbose = TRUE,
allow_par = TRUE
)
)
#> Using existing parallel backend with 2 clusters (cores)...
#> Beginning Parallel Loop | 0.007 seconds
#> Finishing parallel backend. Clusters are remaining open. | 13.31 seconds
#> Close clusters by running: `parallel_stop()`.
#> Total time | 13.31 seconds
This returns a modeltime table.
model_parallel_tbl
#> # Modeltime Table
#> # A tibble: 6 x 3
#> .model_id .model .model_desc
#> <int> <list> <chr>
#> 1 1 <workflow> XGBOOST
#> 2 2 <workflow> XGBOOST
#> 3 3 <workflow> XGBOOST
#> 4 4 <workflow> XGBOOST
#> 5 5 <workflow> XGBOOST
#> 6 6 <workflow> XGBOOST
We can compare to a sequential backend. We have a slight perfomance boost. Note that this performance benefit increases with the size of the training task.
model_sequential_tbl <- model_wfset %>%
modeltime_fit_workflowset(
data = training(splits),
control = control_fit_workflowset(
verbose = TRUE,
allow_par = FALSE
)
)
#> ℹ Fitting Model: 1
#> ✓ Model Successfully Fitted: 1
#> ℹ Fitting Model: 2
#> ✓ Model Successfully Fitted: 2
#> ℹ Fitting Model: 3
#> ✓ Model Successfully Fitted: 3
#> ℹ Fitting Model: 4
#> ✓ Model Successfully Fitted: 4
#> ℹ Fitting Model: 5
#> ✓ Model Successfully Fitted: 5
#> ℹ Fitting Model: 6
#> ✓ Model Successfully Fitted: 6
#> Total time | 16.547 seconds
We can review the forecast accuracy. We can see that Model 5 has the lowest MAE.
model_parallel_tbl %>%
modeltime_calibrate(testing(splits)) %>%
modeltime_accuracy() %>%
table_modeltime_accuracy(.interactive = FALSE)
Accuracy Table | ||||||||
---|---|---|---|---|---|---|---|---|
.model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
1 | XGBOOST | Test | 55572.50 | 98.52 | 1.63 | 194.17 | 66953.92 | 0.96 |
2 | XGBOOST | Test | 48819.23 | 86.15 | 1.43 | 151.49 | 58992.30 | 0.96 |
3 | XGBOOST | Test | 13426.89 | 21.69 | 0.39 | 25.06 | 17376.53 | 0.98 |
4 | XGBOOST | Test | 3699.94 | 8.94 | 0.11 | 8.68 | 5163.37 | 0.98 |
5 | XGBOOST | Test | 3296.74 | 7.30 | 0.10 | 7.37 | 5166.48 | 0.98 |
6 | XGBOOST | Test | 3612.70 | 8.15 | 0.11 | 8.24 | 5308.19 | 0.98 |
We can visualize the forecast.
model_parallel_tbl %>%
modeltime_forecast(
new_data = testing(splits),
actual_data = dataset_tbl,
keep_data = TRUE
) %>%
group_by(id) %>%
plot_modeltime_forecast(
.facet_ncol = 3,
.interactive = FALSE
)
We just showcased a simple Hyperparameter Tuning example using Parallel Processing. But this is a simple problem. And, there’s a lot more to learning time series.
Your probably thinking how am I ever going to learn time series forecasting. Here’s the solution that will save you years of struggling.
Become the forecasting expert for your organization
High-Performance Time Series Course
Time series is changing. Businesses now need 10,000+ time series forecasts every day. This is what I call a High-Performance Time Series Forecasting System (HPTSF) - Accurate, Robust, and Scalable Forecasting.
High-Performance Forecasting Systems will save companies by improving accuracy and scalability. Imagine what will happen to your career if you can provide your organization a “High-Performance Time Series Forecasting System” (HPTSF System).
I teach how to build a HPTFS System in my High-Performance Time Series Forecasting Course. You will learn:
Modeltime
- 30+ Models (Prophet, ARIMA, XGBoost, Random Forest, & many more)GluonTS
(Competition Winners)Become the Time Series Expert for your organization.