Clean Outliers and Missing Data for Time Series

step_ts_clean creates a specification of a recipe step that will clean outliers and impute time series data.

Usage

step_ts_clean(
  recipe,
  ...,
  period = 1,
  lambda = "auto",
  role = NA,
  trained = FALSE,
  lambdas_trained = NULL,
  skip = FALSE,
  id = rand_id("ts_clean")
)

# S3 method for step_ts_clean
tidy(x, ...)

Arguments

recipe: A recipe object. The step will be added to the sequence of operations for this recipe.
...: One or more selector functions to choose which variables are affected by the step. See selections() for more details. For the tidy method, these are not currently used.
period: A seasonal period to use during the transformation. If period = 1, linear interpolation is performed. If period > 1, a robust STL decomposition is first performed and a linear interpolation is applied to the seasonally adjusted data.
lambda: A box cox transformation parameter. If set to "auto", performs automated lambda selection.
role: Not used by this step since no new variables are created.
trained: A logical to indicate if the quantities for preprocessing have been estimated.
lambdas_trained: A named numeric vector of lambdas. This is NULL until computed by recipes::prep(). Note that, if the original data are integers, the mean will be converted to an integer to maintain the same a data type.
skip: A logical. Should the step be skipped when the recipe is baked by bake.recipe()? While all operations are baked when prep.recipe() is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using skip = TRUE as it may affect the computations for subsequent operations.
id: A character string that is unique to this step to identify it.
x: A step_ts_clean object.

Value

An updated version of recipe with the new step added to the sequence of existing steps (if any). For the tidy method, a tibble with columns terms (the selectors or variables selected) and value (the lambda estimate).

Details

The step_ts_clean() function is designed specifically to handle time series using seasonal outlier detection methods implemented in the Forecast R Package.

Cleaning Outliers

#' Outliers are replaced with missing values using the following methods:

Non-Seasonal (period = 1): Uses stats::supsmu()
Seasonal (period > 1): Uses forecast::mstl() with robust = TRUE (robust STL decomposition) for seasonal series.

Imputation using Linear Interpolation

Three circumstances cause strictly linear interpolation:

Period is 1: With period = 1, a seasonality cannot be interpreted and therefore linear is used.
Number of Non-Missing Values is less than 2-Periods: Insufficient values exist to detect seasonality.
Number of Total Values is less than 3-Periods: Insufficient values exist to detect seasonality.

Seasonal Imputation using Linear Interpolation

For seasonal series with period > 1, a robust Seasonal Trend Loess (STL) decomposition is first computed. Then a linear interpolation is applied to the seasonally adjusted data, and the seasonal component is added back.

Box Cox Transformation

In many circumstances, a Box Cox transformation can help. Especially if the series is multiplicative meaning the variance grows exponentially. A Box Cox transformation can be automated by setting lambda = "auto" or can be specified by setting lambda = numeric value.

References

Examples


library(dplyr)
library(tidyr)
library(recipes)

# Get missing values
FANG_wide <- FANG %>%
    select(symbol, date, adjusted) %>%
    pivot_wider(names_from = symbol, values_from = adjusted) %>%
    pad_by_time()
#> .date_var is missing. Using: date
#> pad applied on the interval: day

FANG_wide
#> # A tibble: 1,459 × 5
#>    date          FB  AMZN  NFLX  GOOG
#>    <date>     <dbl> <dbl> <dbl> <dbl>
#>  1 2013-01-02  28    257.  13.1  361.
#>  2 2013-01-03  27.8  258.  13.8  361.
#>  3 2013-01-04  28.8  259.  13.7  369.
#>  4 2013-01-05  NA     NA   NA     NA 
#>  5 2013-01-06  NA     NA   NA     NA 
#>  6 2013-01-07  29.4  268.  14.2  367.
#>  7 2013-01-08  29.1  266.  13.9  366.
#>  8 2013-01-09  30.6  266.  13.7  369.
#>  9 2013-01-10  31.3  265.  14    370.
#> 10 2013-01-11  31.7  268.  14.5  370.
#> # ℹ 1,449 more rows

# Apply Imputation
recipe_box_cox <- recipe(~ ., data = FANG_wide) %>%
    step_ts_clean(FB, AMZN, NFLX, GOOG, period = 252) %>%
    prep()

recipe_box_cox %>% bake(FANG_wide)
#> # A tibble: 1,459 × 5
#>    date          FB  AMZN  NFLX  GOOG
#>    <date>     <dbl> <dbl> <dbl> <dbl>
#>  1 2013-01-02  28    257.  13.1  361.
#>  2 2013-01-03  27.8  258.  13.8  361.
#>  3 2013-01-04  28.8  259.  13.7  369.
#>  4 2013-01-05  28.2  262.  14.1  365.
#>  5 2013-01-06  28.4  264.  14.6  366.
#>  6 2013-01-07  29.4  268.  14.2  367.
#>  7 2013-01-08  29.1  266.  13.9  366.
#>  8 2013-01-09  30.6  266.  13.7  369.
#>  9 2013-01-10  31.3  265.  14    370.
#> 10 2013-01-11  31.7  268.  14.5  370.
#> # ℹ 1,449 more rows

# Lambda parameter used during imputation process
recipe_box_cox %>% tidy(1)
#> # A tibble: 4 × 3
#>   terms lambda id            
#>   <chr>  <dbl> <chr>         
#> 1 FB     0.912 ts_clean_RbUY2
#> 2 AMZN   0.557 ts_clean_RbUY2
#> 3 NFLX   0.532 ts_clean_RbUY2
#> 4 GOOG  -1.00  ts_clean_RbUY2