Simple Training/Test Set Splitting for Time Series
Source:R/rsample-time_series_split.R
time_series_split.Rd
time_series_split
creates resample splits using time_series_cv()
but
returns only a single split. This is useful when creating a single
train/test split.
Usage
time_series_split(
data,
date_var = NULL,
initial = 5,
assess = 1,
skip = 1,
lag = 0,
cumulative = FALSE,
slice = 1,
point_forecast = FALSE,
...
)
Arguments
- data
A data frame.
- date_var
A date or date-time variable.
- initial
The number of samples used for analysis/modeling in the initial resample.
- assess
The number of samples used for each assessment resample.
- skip
A integer indicating how many (if any) additional resamples to skip to thin the total amount of data points in the analysis resample. See the example below.
- lag
A value to include an lag between the assessment and analysis set. This is useful if lagged predictors will be used during training and testing.
- cumulative
A logical. Should the analysis resample grow beyond the size specified by
initial
at each resample?.- slice
Returns a single slice from time_series_cv
- point_forecast
Whether or not to have the testing set be a single point forecast or to be a forecast horizon. The default is to be a forecast horizon. Default:
FALSE
- ...
These dots are for future extensions and must be empty.
Value
An rsplit
object that can be used with the training
and testing
functions to extract the data in each split.
Details
Time-Based Specification
The initial
, assess
, skip
, and lag
variables can be specified as:
Numeric:
initial = 24
Time-Based Phrases:
initial = "2 years"
, if thedata
contains adate_var
(date or datetime)
Initial (Training Set) and Assess (Testing Set)
The main options, initial
and assess
, control the number of
data points from the original data that are in the analysis (training set)
and the assessment (testing set), respectively.
Skip
skip
enables the function to not use every data point in the resamples.
When skip = 1
, the resampling data sets will increment by one position.
Example: Suppose that the rows of a data set are consecutive days. Using skip = 7
will make the analysis data set operate on weeks instead of days. The
assessment set size is not affected by this option.
Lag
The Lag parameter creates an overlap between the Testing set. This is needed when lagged predictors are used.
Cumulative vs Sliding Window
When cumulative = TRUE
, the initial
parameter is ignored and the
analysis (training) set will grow as
resampling continues while the assessment (testing) set size will always remain
static.
When cumulative = FALSE
, the initial
parameter fixes the analysis (training)
set and resampling occurs over a fixed window.
Slice
This controls which slice is returned. If slice = 1
, only the most recent
slice will be returned.
See also
time_series_cv()
andrsample::rolling_origin()
- Functions used to create time series resample specifications.
Examples
library(dplyr)
# DATA ----
m750 <- m4_monthly %>% dplyr::filter(id == "M750")
# Get the most recent 3 years as testing, and previous 10 years as training
m750 %>%
time_series_split(initial = "10 years", assess = "3 years")
#> Using date_var: date
#> <Analysis/Assess/Total>
#> <120/36/306>
# Skip the most recent 3 years
m750 %>%
time_series_split(
initial = "10 years",
assess = "3 years",
skip = "3 years",
slice = 2 # <- Returns 2nd slice, 3-years back
)
#> Using date_var: date
#> <Analysis/Assess/Total>
#> <120/36/306>
# Add 1 year lag for testing overlap
m750 %>%
time_series_split(
initial = "10 years",
assess = "3 years",
skip = "3 years",
slice = 2,
lag = "1 year" # <- Overlaps training/testing by 1 year
)
#> Using date_var: date
#> <Analysis/Assess/Total>
#> <120/48/306>