Reduce Forecast Error with Cleaned Anomalies

Forecasting error can often be reduced 20% to 50% by repairing anomolous data

Example - Reducing Forecasting Error by 32%

We can often get better forecast performance by cleaning anomalous data prior to forecasting. This is the perfect use case for integrating the clean_anomalies() function into your forecast workflow.

library(tidyverse)
library(tidyquant)
library(anomalize)
library(timetk)

# NOTE: timetk now has anomaly detection built in, which 
#  will get the new functionality going forward.
#  Use this script to prevent overwriting legacy anomalize:

anomalize <- anomalize::anomalize
plot_anomalies <- anomalize::plot_anomalies

Here is a short example with the tidyverse_cran_downloads dataset that comes with anomalize. We’ll see how we can reduce the forecast error by 32% simply by repairing anomalies.

tidyverse_cran_downloads
#> # A tibble: 6,375 × 3
#> # Groups:   package [15]
#>    date       count package
#>    <date>     <dbl> <chr>  
#>  1 2017-01-01   873 tidyr  
#>  2 2017-01-02  1840 tidyr  
#>  3 2017-01-03  2495 tidyr  
#>  4 2017-01-04  2906 tidyr  
#>  5 2017-01-05  2847 tidyr  
#>  6 2017-01-06  2756 tidyr  
#>  7 2017-01-07  1439 tidyr  
#>  8 2017-01-08  1556 tidyr  
#>  9 2017-01-09  3678 tidyr  
#> 10 2017-01-10  7086 tidyr  
#> # ℹ 6,365 more rows

Let’s take one package with some extreme events. We can hone in on lubridate, which has some outliers that we can fix.

tidyverse_cran_downloads %>%
  ggplot(aes(date, count, color = package)) +
  geom_point(alpha = 0.5) +
  facet_wrap(~ package, ncol = 3, scales = "free_y") +
  scale_color_viridis_d() +
  theme_tq()

Forecasting Lubridate Downloads

Let’s focus on downloads of the lubridate R package.

lubridate_tbl <- tidyverse_cran_downloads %>%
  ungroup() %>%
  filter(package == "lubridate")

First, we’ll make a function, forecast_mae(), that can take the input of both cleaned and uncleaned anomalies and calculate forecast error of future uncleaned anomalies.

The modeling function uses the following criteria:

Split the data into training and testing data that maintains the correct time-series sequence using the prop argument.
Models the daily time series of the training data set from observed (demonstrates no cleaning) or observed and cleaned (demonstrates improvement from cleaning). Specified by the col_train argument.
Compares the predictions to the observed values. Specified by the col_test argument.

forecast_mae <- function(data, col_train, col_test, prop = 0.8) {
  
  predict_expr <- enquo(col_train)
  actual_expr <- enquo(col_test)
  
  idx_train <- 1:(floor(prop * nrow(data)))
  
  train_tbl <- data %>% filter(row_number() %in% idx_train)
  test_tbl  <- data %>% filter(!row_number() %in% idx_train)
  
  # Model using training data (training) 
  model_formula <- as.formula(paste0(quo_name(predict_expr), " ~ index.num + year + quarter + month.lbl + day + wday.lbl"))
  
  model_glm <- train_tbl %>%
    tk_augment_timeseries_signature() %>%
    glm(model_formula, data = .)
  
  # Make Prediction
  suppressWarnings({
    # Suppress rank-deficit warning
    prediction <- predict(model_glm, newdata = test_tbl %>% tk_augment_timeseries_signature()) 
    actual     <- test_tbl %>% pull(!! actual_expr)
  })
  
  # Calculate MAE
  mae <- mean(abs(prediction - actual))
  
  return(mae)
  
}

Workflow for Cleaning Anomalies

We will use the anomalize workflow of decomposing (time_decompose()) and identifying anomalies (anomalize()). We use the function, clean_anomalies(), to add new column called “observed_cleaned” that is repaired by replacing all anomalies with the trend + seasonal components from the decompose operation. We can now experiment to see the improvment in forecasting performance by comparing a forecast made with “observed” versus “observed_cleaned”

lubridate_anomalized_tbl <- lubridate_tbl %>%
  time_decompose(count) %>%
  anomalize(remainder) %>%
  
  # Function to clean & repair anomalous data
  clean_anomalies()
#> Converting from tbl_df to tbl_time.
#> Auto-index message: index = date
#> frequency = 7 days
#> trend = 91 days

lubridate_anomalized_tbl
#> # A time tibble: 425 × 9
#> # Index:         date
#>    date       observed season trend remainder remainder_l1 remainder_l2 anomaly
#>    <date>        <dbl>  <dbl> <dbl>     <dbl>        <dbl>        <dbl> <chr>  
#>  1 2017-01-01      643 -2078. 2474.      246.       -3323.        3310. No     
#>  2 2017-01-02     1350   518. 2491.    -1659.       -3323.        3310. No     
#>  3 2017-01-03     2940  1117. 2508.     -685.       -3323.        3310. No     
#>  4 2017-01-04     4269  1220. 2524.      525.       -3323.        3310. No     
#>  5 2017-01-05     3724   865. 2541.      318.       -3323.        3310. No     
#>  6 2017-01-06     2326   356. 2558.     -588.       -3323.        3310. No     
#>  7 2017-01-07     1107 -1998. 2574.      531.       -3323.        3310. No     
#>  8 2017-01-08     1058 -2078. 2591.      545.       -3323.        3310. No     
#>  9 2017-01-09     2494   518. 2608.     -632.       -3323.        3310. No     
#> 10 2017-01-10     3237  1117. 2624.     -504.       -3323.        3310. No     
#> # ℹ 415 more rows
#> # ℹ 1 more variable: observed_cleaned <dbl>

Before Cleaning with anomalize

lubridate_anomalized_tbl %>%
  forecast_mae(col_train = observed, col_test = observed, prop = 0.8)
#> tk_augment_timeseries_signature(): Using the following .date_var variable: date
#> tk_augment_timeseries_signature(): Using the following .date_var variable: date
#> [1] 4054.053

After Cleaning with anomalize

lubridate_anomalized_tbl %>%
  forecast_mae(col_train = observed_cleaned, col_test = observed, prop = 0.8)
#> tk_augment_timeseries_signature(): Using the following .date_var variable: date
#> tk_augment_timeseries_signature(): Using the following .date_var variable: date
#> [1] 2755.297

32% Reduction in Forecast Error

This is approximately a 32% reduction in forecast error as measure by Mean Absolute Error (MAE).

(2755 - 4054) / 4054 
#> [1] -0.3204243

Interested in Learning Anomaly Detection?

Business Science offers two 1-hour courses on Anomaly Detection:

Learning Lab 18 - Time Series Anomaly Detection with anomalize
Learning Lab 17 - Anomaly Detection with H2O Machine Learning

Business Science