Detect anomalies using the tidyverse

anomalize(data, target, method = c("iqr", "gesd"), alpha = 0.05,
  max_anoms = 0.2, verbose = FALSE)

Arguments

data

A tibble or tbl_time object.

target

A column to apply the function to

method

The anomaly detection method. One of "iqr" or "gesd". The IQR method is faster at the expense of possibly not being quite as accurate. The GESD method has the best properties for outlier detection, but is loop-based and therefore a bit slower.

alpha

Controls the width of the "normal" range. Lower values are more conservative while higher values are less prone to incorrectly classifying "normal" observations.

max_anoms

The maximum percent of anomalies permitted to be identified.

verbose

A boolean. If TRUE, will return a list containing useful information about the anomalies. If FALSE, just returns the data expanded with the anomalies and the lower (l1) and upper (l2) bounds.

Value

Returns a tibble / tbl_time object or list depending on the value of verbose.

Details

The anomalize() function is used to detect outliers in a distribution with no trend or seasonality present. The return has three columns: "remainder_l1" (lower limit for anomalies), "remainder_l2" (upper limit for anomalies), and "anomaly" (Yes/No).

Use time_decompose() to decompose a time series prior to performing anomaly detection with anomalize(). Typically, anomalize() is performed on the "remainder" of the time series decomposition.

For non-time series data (data without trend), the anomalize() function can be used without time series decomposition.

The anomalize() function uses two methods for outlier detection each with benefits.

IQR:

The IQR Method uses an innerquartile range of 25 the median. With the default alpha = 0.05, the limits are established by expanding the 25/75 baseline by an IQR Factor of 3 (3X). The IQR Factor = 0.15 / alpha (hense 3X with alpha = 0.05). To increase the IQR Factor controling the limits, decrease the alpha, which makes it more difficult to be an outlier. Increase alpha to make it easier to be an outlier.

The IQR method is used in forecast::tsoutliers().

GESD:

The GESD Method (Generlized Extreme Studentized Deviate Test) progressively eliminates outliers using a Student's T-Test comparing the test statistic to a critical value. Each time an outlier is removed, the test statistic is updated. Once test statistic drops below the critical value, all outliers are considered removed. Because this method involves continuous updating via a loop, it is slower than the IQR method. However, it tends to be the best performing method for outlier removal.

The GESD method is used in AnomalyDection::AnomalyDetectionTs().

References

  1. How to correct outliers once detected for time series data forecasting? Cross Validated, https://stats.stackexchange.com

  2. Cross Validated: Simple algorithm for online outlier detection of a generic time series. Cross Validated, https://stats.stackexchange.com

  3. Owen S. Vallis, Jordan Hochenbaum and Arun Kejariwal (2014).A Novel Technique for Long-Term Anomaly Detection in the Cloud. Twitter Inc.

  4. Owen S. Vallis, Jordan Hochenbaum and Arun Kejariwal (2014). AnomalyDetection: Anomaly Detection UsingSeasonal Hybrid Extreme Studentized Deviate Test. R package version 1.0.

  5. Alex T.C. Lau (November/December 2015). GESD - A Robust and Effective Technique for Dealing with Multiple Outliers. ASTM Standardization News. www.astm.org/sn

See also

Anomaly Detection Methods (Powers anomalize)

Time Series Anomaly Detection Functions (anomaly detection workflow):

Examples

library(dplyr) # Needed to pass CRAN check / This is loaded by default set_time_scale_template(time_scale_template()) data(tidyverse_cran_downloads) tidyverse_cran_downloads %>% time_decompose(count, method = "stl") %>% anomalize(remainder, method = "iqr")
#> # A time tibble: 6,375 x 9 #> # Index: date #> # Groups: package [15] #> package date observed season trend remainder remainder_l1 remainder_l2 #> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 tidyr 2017-01-01 873. -2761. 5053. -1418. -3748. 3708. #> 2 tidyr 2017-01-02 1840. 901. 5047. -4108. -3748. 3708. #> 3 tidyr 2017-01-03 2495. 1460. 5041. -4006. -3748. 3708. #> 4 tidyr 2017-01-04 2906. 1430. 5035. -3559. -3748. 3708. #> 5 tidyr 2017-01-05 2847. 1239. 5029. -3421. -3748. 3708. #> 6 tidyr 2017-01-06 2756. 367. 5024. -2635. -3748. 3708. #> 7 tidyr 2017-01-07 1439. -2635. 5018. -944. -3748. 3708. #> 8 tidyr 2017-01-08 1556. -2761. 5012. -695. -3748. 3708. #> 9 tidyr 2017-01-09 3678. 901. 5006. -2229. -3748. 3708. #> 10 tidyr 2017-01-10 7086. 1460. 5000. 626. -3748. 3708. #> # ... with 6,365 more rows, and 1 more variable: anomaly <chr>