tk_anomaly_diagnostics() is the preprocessor for plot_anomaly_diagnostics(). It performs automatic anomaly detection for one or more time series groups.

tk_anomaly_diagnostics(
  .data,
  .date_var,
  .value,
  .frequency = "auto",
  .trend = "auto",
  .alpha = 0.05,
  .max_anomalies = 0.2,
  .message = TRUE
)

# S3 method for data.frame
tk_anomaly_diagnostics(
  .data,
  .date_var,
  .value,
  .frequency = "auto",
  .trend = "auto",
  .alpha = 0.05,
  .max_anomalies = 0.2,
  .message = TRUE
)

Arguments

.data

A tibble or data.frame with a time-based column

.date_var

A column containing either date or date-time values

.value

A column containing numeric values

.frequency

Controls the seasonal adjustment (removal of seasonality). Input can be either "auto", a time-based definition (e.g. "2 weeks"), or a numeric number of observations per frequency (e.g. 10). Refer to tk_get_frequency().

.trend

Controls the trend component. For STL, trend controls the sensitivity of the LOESS smoother, which is used to remove the remainder. Refer to tk_get_trend().

.alpha

Controls the width of the "normal" range. Lower values are more conservative while higher values are less prone to incorrectly classifying "normal" observations.

.max_anomalies

The maximum percent of anomalies permitted to be identified.

.message

A boolean. If TRUE, will output information related to automatic frequency and trend selection (if applicable).

Value

A tibble or data.frame with STL Decomposition Features (observed, season, trend, remainder, seasadj) and Anomaly Features (remainder_l1, remainder_l2, anomaly, recomposed_l1, and recomposed_l2)

Details

The tk_anomaly_diagnostics() method for anomaly detection that implements a 2-step process to detect outliers in time series.

Step 1: Detrend & Remove Seasonality using STL Decomposition

The decomposition separates the "season" and "trend" components from the "observed" values leaving the "remainder" for anomaly detection.

The user can control two parameters: frequency and trend.

  1. .frequency: Adjusts the "season" component that is removed from the "observed" values.

  2. .trend: Adjusts the trend window (t.window parameter from stats::stl() that is used.

The user may supply both .frequency and .trend as time-based durations (e.g. "6 weeks") or numeric values (e.g. 180) or "auto", which predetermines the frequency and/or trend based on the scale of the time series using the tk_time_scale_template().

Step 2: Anomaly Detection

Once "trend" and "season" (seasonality) is removed, anomaly detection is performed on the "remainder". Anomalies are identified, and boundaries (recomposed_l1 and recomposed_l2) are determined.

The Anomaly Detection Method uses an inner quartile range (IQR) of +/-25 the median.

IQR Adjustment, alpha parameter

With the default alpha = 0.05, the limits are established by expanding the 25/75 baseline by an IQR Factor of 3 (3X). The IQR Factor = 0.15 / alpha (hence 3X with alpha = 0.05):

  • To increase the IQR Factor controlling the limits, decrease the alpha, which makes it more difficult to be an outlier.

  • Increase alpha to make it easier to be an outlier.

  • The IQR outlier detection method is used in forecast::tsoutliers().

  • A similar outlier detection method is used by Twitter's AnomalyDetection package.

  • Both Twitter and Forecast tsoutliers methods have been implemented in Business Science's anomalize package.

References

  1. CLEVELAND, R. B., CLEVELAND, W. S., MCRAE, J. E., AND TERPENNING, I. STL: A Seasonal-Trend Decomposition Procedure Based on Loess. Journal of Official Statistics, Vol. 6, No. 1 (1990), pp. 3-73.

  2. Owen S. Vallis, Jordan Hochenbaum and Arun Kejariwal (2014). A Novel Technique for Long-Term Anomaly Detection in the Cloud. Twitter Inc.

See also

Examples

library(dplyr) library(timetk) walmart_sales_weekly %>% filter(id %in% c("1_1", "1_3")) %>% group_by(id) %>% tk_anomaly_diagnostics(Date, Weekly_Sales)
#> frequency = 13 observations per 1 quarter
#> trend = 52 observations per 1 year
#> frequency = 13 observations per 1 quarter
#> trend = 52 observations per 1 year
#> # A tibble: 286 x 12 #> # Groups: id [2] #> id Date observed season trend remainder seasadj remainder_l1 #> <fct> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1_1 2010-02-05 24924. 874. 19967. 4083. 24050. -15981. #> 2 1_1 2010-02-12 46039. -698. 19835. 26902. 46737. -15981. #> 3 1_1 2010-02-19 41596. -1216. 19703. 23108. 42812. -15981. #> 4 1_1 2010-02-26 19404. -821. 19571. 653. 20224. -15981. #> 5 1_1 2010-03-05 21828. 324. 19439. 2064. 21504. -15981. #> 6 1_1 2010-03-12 21043. 471. 19307. 1265. 20572. -15981. #> 7 1_1 2010-03-19 22137. 920. 19175. 2041. 21217. -15981. #> 8 1_1 2010-03-26 26229. 752. 19069. 6409. 25478. -15981. #> 9 1_1 2010-04-02 57258. 503. 18962. 37794. 56755. -15981. #> 10 1_1 2010-04-09 42961. 1132. 18855. 22974. 41829. -15981. #> # … with 276 more rows, and 4 more variables: remainder_l2 <dbl>, #> # anomaly <chr>, recomposed_l1 <dbl>, recomposed_l2 <dbl>