step_ts_clean
creates a specification of a recipe
step that will clean outliers and impute time series data.
Arguments
- recipe
A
recipe
object. The step will be added to the sequence of operations for this recipe.- ...
One or more selector functions to choose which variables are affected by the step. See
selections()
for more details. For thetidy
method, these are not currently used.- period
A seasonal period to use during the transformation. If
period = 1
, linear interpolation is performed. Ifperiod > 1
, a robust STL decomposition is first performed and a linear interpolation is applied to the seasonally adjusted data.- lambda
A box cox transformation parameter. If set to
"auto"
, performs automated lambda selection.- role
Not used by this step since no new variables are created.
- trained
A logical to indicate if the quantities for preprocessing have been estimated.
- lambdas_trained
A named numeric vector of lambdas. This is
NULL
until computed byrecipes::prep()
. Note that, if the original data are integers, the mean will be converted to an integer to maintain the same a data type.- skip
A logical. Should the step be skipped when the recipe is baked by
bake.recipe()
? While all operations are baked whenprep.recipe()
is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when usingskip = TRUE
as it may affect the computations for subsequent operations.- id
A character string that is unique to this step to identify it.
- x
A
step_ts_clean
object.
Value
An updated version of recipe
with the new step
added to the sequence of existing steps (if any). For the
tidy
method, a tibble with columns terms
(the
selectors or variables selected) and value
(the
lambda estimate).
Details
The step_ts_clean()
function is designed specifically to handle time series
using seasonal outlier detection methods implemented in the Forecast R Package.
Cleaning Outliers
#' Outliers are replaced with missing values using the following methods:
Non-Seasonal (
period = 1
): Usesstats::supsmu()
Seasonal (
period > 1
): Usesforecast::mstl()
withrobust = TRUE
(robust STL decomposition) for seasonal series.
Imputation using Linear Interpolation
Three circumstances cause strictly linear interpolation:
Period is 1: With
period = 1
, a seasonality cannot be interpreted and therefore linear is used.Number of Non-Missing Values is less than 2-Periods: Insufficient values exist to detect seasonality.
Number of Total Values is less than 3-Periods: Insufficient values exist to detect seasonality.
Seasonal Imputation using Linear Interpolation
For seasonal series with period > 1
, a robust Seasonal Trend Loess (STL) decomposition is first computed.
Then a linear interpolation is applied to the seasonally adjusted data, and
the seasonal component is added back.
Box Cox Transformation
In many circumstances, a Box Cox transformation can help. Especially if the series is multiplicative
meaning the variance grows exponentially. A Box Cox transformation can be automated by setting lambda = "auto"
or can be specified by setting lambda = numeric value
.
See also
Time Series Analysis:
Engineered Features:
step_timeseries_signature()
,step_holiday_signature()
,step_fourier()
Diffs & Lags
step_diff()
,recipes::step_lag()
Smoothing:
step_slidify()
,step_smooth()
Variance Reduction:
step_box_cox()
Imputation:
step_ts_impute()
,step_ts_clean()
Padding:
step_ts_pad()
Examples
library(dplyr)
library(tidyr)
library(recipes)
# Get missing values
FANG_wide <- FANG %>%
select(symbol, date, adjusted) %>%
pivot_wider(names_from = symbol, values_from = adjusted) %>%
pad_by_time()
#> .date_var is missing. Using: date
#> pad applied on the interval: day
FANG_wide
#> # A tibble: 1,459 × 5
#> date FB AMZN NFLX GOOG
#> <date> <dbl> <dbl> <dbl> <dbl>
#> 1 2013-01-02 28 257. 13.1 361.
#> 2 2013-01-03 27.8 258. 13.8 361.
#> 3 2013-01-04 28.8 259. 13.7 369.
#> 4 2013-01-05 NA NA NA NA
#> 5 2013-01-06 NA NA NA NA
#> 6 2013-01-07 29.4 268. 14.2 367.
#> 7 2013-01-08 29.1 266. 13.9 366.
#> 8 2013-01-09 30.6 266. 13.7 369.
#> 9 2013-01-10 31.3 265. 14 370.
#> 10 2013-01-11 31.7 268. 14.5 370.
#> # ℹ 1,449 more rows
# Apply Imputation
recipe_box_cox <- recipe(~ ., data = FANG_wide) %>%
step_ts_clean(FB, AMZN, NFLX, GOOG, period = 252) %>%
prep()
recipe_box_cox %>% bake(FANG_wide)
#> # A tibble: 1,459 × 5
#> date FB AMZN NFLX GOOG
#> <date> <dbl> <dbl> <dbl> <dbl>
#> 1 2013-01-02 28 257. 13.1 361.
#> 2 2013-01-03 27.8 258. 13.8 361.
#> 3 2013-01-04 28.8 259. 13.7 369.
#> 4 2013-01-05 28.2 262. 14.1 365.
#> 5 2013-01-06 28.4 264. 14.6 366.
#> 6 2013-01-07 29.4 268. 14.2 367.
#> 7 2013-01-08 29.1 266. 13.9 366.
#> 8 2013-01-09 30.6 266. 13.7 369.
#> 9 2013-01-10 31.3 265. 14 370.
#> 10 2013-01-11 31.7 268. 14.5 370.
#> # ℹ 1,449 more rows
# Lambda parameter used during imputation process
recipe_box_cox %>% tidy(1)
#> # A tibble: 4 × 3
#> terms lambda id
#> <chr> <dbl> <chr>
#> 1 FB 0.912 ts_clean_RbUY2
#> 2 AMZN 0.557 ts_clean_RbUY2
#> 3 NFLX 0.532 ts_clean_RbUY2
#> 4 GOOG -1.00 ts_clean_RbUY2