Anomalize: Breakdown, identify, and clean anomalies in 1 easy step
Anomalies, often called outliers, are data points that deviate significantly from the general trend or pattern in the data. In the context of time series, they can appear as sudden spikes, drops, or any abrupt change in a sequence of values.
Anomaly detection for time series is a technique used to identify unusual patterns that do not conform to expected behavior. It is especially relevant for sequential data (like stock prices, sensor data, sales data, etc.) where the temporal aspect is crucial. Anomalies can identify important events or be the cause of noise that can hinder forecasting performance.
1 This applied tutorial covers the use of:
tk.anomalize(): A single function that integrates time series decomposition, anomaly identification (scoring), and outlier cleaning.
tk.plot_anomalies_decomp(): The first step towards identifying if your anomalization is detecting outliers to your needs.
tk.plot_anomalies(): The second step to visualize the anomalies.
tk.plot_anomalies_cleaned(): Compare the data with anomalies removed (before and after)
How to navigate this guide
This applied tutorial is separated into 2 parts:
We have a quick start section called “5-Minutes to Anomalize” for those looking to jump right in.
We also have a detailed section on parameter adjustment for those looking to understand what nobs they can turn.
2 Five (5) Minutes to Anomalize
Load these libraries to get started.
Code
import pytimetk as tkimport pandas as pd
Next, get some data. We’ll use the wikipedia_traffic_daily data set that comes with anomalize. This contains data on various websites
We’ll glimpse() the data to get a sense of what we are working with.
We can see there are some spikes, but are these anomalies? Let’s use anomalize() to detect.
2.1 Anomalize: breakdown, identify, and clean in 1 easy step
The anomalize() function is a feature rich tool for performing anomaly detection. Anomalize is group-aware, so we can use this as part of a normal pandas groupby chain. In one easy step:
We breakdown (decompose) the time series
Analyze it’s remainder (residuals) for spikes (anomalies)
The seasonal decomposition: observed, seasonal, seasadj, trend, and remainder. The objective is to remove trend and seasonality such that the remainder is stationary and representative of normal variation and anomalous variations.
Anomaly identification and scoring: anomaly, anomaly_score, anomaly_direction. These identify the anomaly decision (Yes/No), score the anomaly as a distance from the centerline, and label the direction (-1 (down), zero (not anomalous), +1 (up)).
Recomposition: recomposed_l1 and recomposed_l2. Think of these as the lower and upper bands. Any observed data that is below l1 or above l2 is anomalous.
Cleaned data: observed_clean. Cleaned data is automatically provided, which has the outliers replaced with data that is within the recomposed l1/l2 boundaries. With that said, you should always first seek to understand why data is being considered anomalous before simply removing outliers and using the cleaned data.
The most important aspect is that this data is ready to be visualized, inspected, and modifications can then be made to address any tweaks you would like to make.
2.2 Visualization 1: Seasonal Decomposition Plot
The first step in my normal process is to analyze the seasonal decomposition. I want to see what the remainders look like, and make sure that the trend and seasonality are being removed such that the remainder is centered around zero.
What to do when the remainders have trend or seasonality?
We’ll cover how to tweak the nobs of anomalize() in the next section aptly named “How to tweak the nobs on anomalize”.
Once I’m satisfied with the remainders, my next step is to visualize the anomalies. Here I’m looking to see if I need to grow or shrink the remainder l1 and l2 bands, which classify anomalies.
There are pros and cons to cleaning anomalies. I’ll leave that discussion for another time. But, should you be interested in seeing what your data looks like cleaned (with outliers removed), this plot will help you compare before and after.