anomalize() is used to detect anomalies in time series data,
either for a single time series or for multiple time series grouped by a specific column.
Usage
anomalize(
  .data,
  .date_var,
  .value,
  .frequency = "auto",
  .trend = "auto",
  .method = "stl",
  .iqr_alpha = 0.05,
  .clean_alpha = 0.75,
  .max_anomalies = 0.2,
  .message = TRUE
)Arguments
- .data
- A - tibbleor- data.framewith a time-based column
- .date_var
- A column containing either date or date-time values 
- .value
- A column containing numeric values 
- .frequency
- Controls the seasonal adjustment (removal of seasonality). Input can be either "auto", a time-based definition (e.g. "2 weeks"), or a numeric number of observations per frequency (e.g. 10). Refer to - tk_get_frequency().
- .trend
- Controls the trend component. For STL, trend controls the sensitivity of the LOESS smoother, which is used to remove the remainder. Refer to - tk_get_trend().
- .method
- The outlier detection method. Default: "stl". Currently "stl" is the only method. "twitter" is planned. 
- .iqr_alpha
- Controls the width of the "normal" range. Lower values are more conservative while higher values are less prone to incorrectly classifying "normal" observations. 
- .clean_alpha
- Controls the threshold for cleaning the outliers. The default is 0.75, which means that the anomalies will be cleaned using the 0.75 * lower or upper bound of the recomposed time series, depending on the direction of the anomaly. 
- .max_anomalies
- The maximum percent of anomalies permitted to be identified. 
- .message
- A boolean. If - TRUE, will output information related to automatic frequency and trend selection (if applicable).
Value
A tibble or data.frame with the following columns:
- observed: original data 
- seasonal: seasonal component 
- seasadaj: seasonal adjusted 
- trend: trend component 
- remainder: residual component 
- anomaly: Yes/No flag for outlier detection 
- anomaly score: distance from centerline 
- anomaly direction: -1, 0, 1 inidicator for direction of the anomaly 
- recomposed_l1: lower level bound of recomposed time series 
- recomposed_l2: upper level bound of recomposed time series 
- observed_clean: original data with anomalies interpolated 
Details
The anomalize() method for anomaly detection that implements a 2-step process to
detect outliers in time series.
Step 1: Detrend & Remove Seasonality using STL Decomposition
The decomposition separates the "season" and "trend" components from the "observed" values leaving the "remainder" for anomaly detection.
The user can control two parameters: frequency and trend.
- .frequency: Adjusts the "season" component that is removed from the "observed" values.
- .trend: Adjusts the trend window (t.window parameter from- stats::stl()that is used.
The user may supply both .frequency and .trend as time-based durations (e.g. "6 weeks") or
numeric values (e.g. 180) or "auto", which predetermines the frequency and/or trend based on
the scale of the time series using the tk_time_scale_template().
Step 2: Anomaly Detection
Once "trend" and "season" (seasonality) is removed, anomaly detection is performed on the "remainder". Anomalies are identified, and boundaries (recomposed_l1 and recomposed_l2) are determined.
The Anomaly Detection Method uses an inner quartile range (IQR) of +/-25 the median.
IQR Adjustment, alpha parameter
With the default alpha = 0.05, the limits are established by expanding
the 25/75 baseline by an IQR Factor of 3 (3X).
The IQR Factor = 0.15 / alpha (hence 3X with alpha = 0.05):
- To increase the IQR Factor controlling the limits, decrease the alpha, which makes it more difficult to be an outlier. 
- Increase alpha to make it easier to be an outlier. 
- The IQR outlier detection method is used in - forecast::tsoutliers().
- A similar outlier detection method is used by Twitter's - AnomalyDetectionpackage.
- Both Twitter and Forecast tsoutliers methods have been implemented in Business Science's - anomalizepackage.
References
- CLEVELAND, R. B., CLEVELAND, W. S., MCRAE, J. E., AND TERPENNING, I. STL: A Seasonal-Trend Decomposition Procedure Based on Loess. Journal of Official Statistics, Vol. 6, No. 1 (1990), pp. 3-73. 
- Owen S. Vallis, Jordan Hochenbaum and Arun Kejariwal (2014). A Novel Technique for Long-Term Anomaly Detection in the Cloud. Twitter Inc. 
Examples
library(dplyr)
#> 
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union
walmart_sales_weekly %>%
    filter(id %in% c("1_1", "1_3")) %>%
    group_by(id) %>%
    anomalize(Date, Weekly_Sales)
#> frequency = 13 observations per 1 quarter
#> trend = 52 observations per 1 year
#> frequency = 13 observations per 1 quarter
#> trend = 52 observations per 1 year
#> # A tibble: 286 × 13
#> # Groups:   id [2]
#>    id    Date       observed season  trend remainder seasadj anomaly
#>    <fct> <date>        <dbl>  <dbl>  <dbl>     <dbl>   <dbl> <chr>  
#>  1 1_1   2010-02-05   24924.   874. 19967.     4083.  24050. No     
#>  2 1_1   2010-02-12   46039.  -698. 19835.    26902.  46737. Yes    
#>  3 1_1   2010-02-19   41596. -1216. 19703.    23108.  42812. Yes    
#>  4 1_1   2010-02-26   19404.  -821. 19571.      653.  20224. No     
#>  5 1_1   2010-03-05   21828.   324. 19439.     2064.  21504. No     
#>  6 1_1   2010-03-12   21043.   471. 19307.     1265.  20572. No     
#>  7 1_1   2010-03-19   22137.   920. 19175.     2041.  21217. No     
#>  8 1_1   2010-03-26   26229.   752. 19069.     6409.  25478. No     
#>  9 1_1   2010-04-02   57258.   503. 18962.    37794.  56755. Yes    
#> 10 1_1   2010-04-09   42961.  1132. 18855.    22974.  41829. Yes    
#> # ℹ 276 more rows
#> # ℹ 5 more variables: anomaly_direction <dbl>, anomaly_score <dbl>,
#> #   recomposed_l1 <dbl>, recomposed_l2 <dbl>, observed_clean <dbl>
