anomalize()
is used to detect anomalies in time series data,
either for a single time series or for multiple time series grouped by a specific column.
Usage
anomalize(
.data,
.date_var,
.value,
.frequency = "auto",
.trend = "auto",
.method = "stl",
.iqr_alpha = 0.05,
.clean_alpha = 0.75,
.max_anomalies = 0.2,
.message = TRUE
)
Arguments
- .data
A
tibble
ordata.frame
with a time-based column- .date_var
A column containing either date or date-time values
- .value
A column containing numeric values
- .frequency
Controls the seasonal adjustment (removal of seasonality). Input can be either "auto", a time-based definition (e.g. "2 weeks"), or a numeric number of observations per frequency (e.g. 10). Refer to
tk_get_frequency()
.- .trend
Controls the trend component. For STL, trend controls the sensitivity of the LOESS smoother, which is used to remove the remainder. Refer to
tk_get_trend()
.- .method
The outlier detection method. Default: "stl". Currently "stl" is the only method. "twitter" is planned.
- .iqr_alpha
Controls the width of the "normal" range. Lower values are more conservative while higher values are less prone to incorrectly classifying "normal" observations.
- .clean_alpha
Controls the threshold for cleaning the outliers. The default is 0.75, which means that the anomalies will be cleaned using the 0.75 * lower or upper bound of the recomposed time series, depending on the direction of the anomaly.
- .max_anomalies
The maximum percent of anomalies permitted to be identified.
- .message
A boolean. If
TRUE
, will output information related to automatic frequency and trend selection (if applicable).
Value
A tibble
or data.frame
with the following columns:
observed: original data
seasonal: seasonal component
seasadaj: seasonal adjusted
trend: trend component
remainder: residual component
anomaly: Yes/No flag for outlier detection
anomaly score: distance from centerline
anomaly direction: -1, 0, 1 inidicator for direction of the anomaly
recomposed_l1: lower level bound of recomposed time series
recomposed_l2: upper level bound of recomposed time series
observed_clean: original data with anomalies interpolated
Details
The anomalize()
method for anomaly detection that implements a 2-step process to
detect outliers in time series.
Step 1: Detrend & Remove Seasonality using STL Decomposition
The decomposition separates the "season" and "trend" components from the "observed" values leaving the "remainder" for anomaly detection.
The user can control two parameters: frequency and trend.
.frequency
: Adjusts the "season" component that is removed from the "observed" values..trend
: Adjusts the trend window (t.window parameter fromstats::stl()
that is used.
The user may supply both .frequency
and .trend
as time-based durations (e.g. "6 weeks") or
numeric values (e.g. 180) or "auto", which predetermines the frequency and/or trend based on
the scale of the time series using the tk_time_scale_template()
.
Step 2: Anomaly Detection
Once "trend" and "season" (seasonality) is removed, anomaly detection is performed on the "remainder". Anomalies are identified, and boundaries (recomposed_l1 and recomposed_l2) are determined.
The Anomaly Detection Method uses an inner quartile range (IQR) of +/-25 the median.
IQR Adjustment, alpha parameter
With the default alpha = 0.05
, the limits are established by expanding
the 25/75 baseline by an IQR Factor of 3 (3X).
The IQR Factor = 0.15 / alpha (hence 3X with alpha = 0.05):
To increase the IQR Factor controlling the limits, decrease the alpha, which makes it more difficult to be an outlier.
Increase alpha to make it easier to be an outlier.
The IQR outlier detection method is used in
forecast::tsoutliers()
.A similar outlier detection method is used by Twitter's
AnomalyDetection
package.Both Twitter and Forecast tsoutliers methods have been implemented in Business Science's
anomalize
package.
References
CLEVELAND, R. B., CLEVELAND, W. S., MCRAE, J. E., AND TERPENNING, I. STL: A Seasonal-Trend Decomposition Procedure Based on Loess. Journal of Official Statistics, Vol. 6, No. 1 (1990), pp. 3-73.
Owen S. Vallis, Jordan Hochenbaum and Arun Kejariwal (2014). A Novel Technique for Long-Term Anomaly Detection in the Cloud. Twitter Inc.
Examples
library(dplyr)
#>
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#>
#> filter, lag
#> The following objects are masked from ‘package:base’:
#>
#> intersect, setdiff, setequal, union
walmart_sales_weekly %>%
filter(id %in% c("1_1", "1_3")) %>%
group_by(id) %>%
anomalize(Date, Weekly_Sales)
#> frequency = 13 observations per 1 quarter
#> trend = 52 observations per 1 year
#> frequency = 13 observations per 1 quarter
#> trend = 52 observations per 1 year
#> # A tibble: 286 × 13
#> # Groups: id [2]
#> id Date observed season trend remainder seasadj anomaly
#> <fct> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 1_1 2010-02-05 24924. 874. 19967. 4083. 24050. No
#> 2 1_1 2010-02-12 46039. -698. 19835. 26902. 46737. Yes
#> 3 1_1 2010-02-19 41596. -1216. 19703. 23108. 42812. Yes
#> 4 1_1 2010-02-26 19404. -821. 19571. 653. 20224. No
#> 5 1_1 2010-03-05 21828. 324. 19439. 2064. 21504. No
#> 6 1_1 2010-03-12 21043. 471. 19307. 1265. 20572. No
#> 7 1_1 2010-03-19 22137. 920. 19175. 2041. 21217. No
#> 8 1_1 2010-03-26 26229. 752. 19069. 6409. 25478. No
#> 9 1_1 2010-04-02 57258. 503. 18962. 37794. 56755. Yes
#> 10 1_1 2010-04-09 42961. 1132. 18855. 22974. 41829. Yes
#> # ℹ 276 more rows
#> # ℹ 5 more variables: anomaly_direction <dbl>, anomaly_score <dbl>,
#> # recomposed_l1 <dbl>, recomposed_l2 <dbl>, observed_clean <dbl>