# libraries
import pytimetk as tk
import pandas as pd
import numpy as np
# Import Data
= tk.load_dataset('m4_daily', parse_dates = ['date']) m4_daily_df
Anomaly Detection
Anomaly detection in time series analysis is a crucial process for identifying unusual patterns that deviate from expected behavior. These anomalies can signify critical, often unforeseen events in time series data. Effective anomaly detection helps in maintaining the quality and reliability of data, ensuring accurate forecasting and decision-making. The challenge lies in distinguishing between true anomalies and natural fluctuations, which demands sophisticated analytical techniques and a deep understanding of the underlying time series patterns. As a result, anomaly detection is an essential component of time series analysis, driving the proactive management of risks and opportunities in dynamic environments.
Pytimetk uses the following methods to determine anomalies in time series data;
Decomposition of Time Series:
The first step is to decompose the time series into several components. Commonly, this includes trend, seasonality, and remainder (or residual) components.
Trend represents the underlying pattern or direction in the data over time. Seasonality captures recurring patterns or cycles over a specific period, such as daily, weekly, monthly, etc.
The remainder (or residual) is what’s left after the trend and seasonal components have been removed from the original time series.
Generating Remainders:
After decomposition, the remainder component is extracted. This component reflects the part of the time series that cannot be explained by the trend and seasonal components.
The idea is that while trend and seasonality represent predictable and thus “normal” patterns, the remainder is where anomalies are most likely to manifest.
There are 2 common techniques for seasonal decomposition; STL and Twitter;
STL (Seasonal and Trend Decomposition) is a versatile and robust method for decomposing time series. STL works very well in circumstances where a long term trend is present. The Loess algorithm typically does a very good job at detecting the trend. However, it circumstances when the seasonal component is more dominant than the trend, Twitter tends to perform better.
Twitter method is a similar decomposition method to that used in Twitter’s AnomalyDetection package. The Twitter method works identically to STL for removing the seasonal component. The main difference is in removing the trend, which is performed by removing the median of the data rather than fitting a smoother. The median works well when a long-term trend is less dominant that the short-term seasonal component. This is because the smoother tends to overfit the anomalies.
1 Anomaly Detection in Pytimetk
This section will demonstrate how to use the set of anomalize
functions for in pytimetk;
anomalize()
plot_anomalies()
plot_anomalies_decomp()
plot_anomalies_cleaned()
1.1 Setup
To setup, import the necessary packages and the m4_daily_df
dataset;
Let’s first demonstrate with a single time series. We’ll filter m4_daily_df
for id
= D10
and date
within the year 2015.
# Data filtering
= (
df
m4_daily_df"id == 'D10'")
.query("date.dt.year == 2015")
.query( )
We can plot this data to see the trend
# Plot data
tk.plot_timeseries(= df,
data = 'date',
date_column = 'value'
value_column )
1.2 Seasonal Decomposition & Remainder
First we perform seasonal decomposition and on the data and generate remainders using anomalize()
.
anomalize()
Use help(tk.anomalize)
to review additional helpful documentation.
# Anomalize
= tk.anomalize(
anomalize_df = df,
data = 'date',
date_column = 'value',
value_column = 7,
period = 0.05, # using the default
iqr_alpha = 0.75, # using the default
clean_alpha = "min_max"
clean
)
anomalize_df.glimpse()
<class 'pandas.core.frame.DataFrame'>: 365 rows of 12 columns
date: datetime64[ns] [Timestamp('2015-01-01 00:00:00'), ...
observed: float64 [2351.0, 2302.7, 2300.7, 2341.2, 2 ...
seasonal: float64 [14.163009085035995, -17.341946034 ...
seasadj: float64 [2336.836990914964, 2320.041946034 ...
trend: float64 [2323.900317851228, 2322.996460334 ...
remainder: float64 [12.93667306373618, -2.95451429904 ...
anomaly: object ['No', 'No', 'No', 'No', 'No', 'No ...
anomaly_score: float64 [19.42215274680143, 35.31334010958 ...
anomaly_direction: int64 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
recomposed_l1: float64 [2179.860403909094, 2147.451591271 ...
recomposed_l2: float64 [2560.9839015845087, 2528.57508894 ...
observed_clean: float64 [2351.0, 2302.7, 2300.7, 2341.2, 2 ...
1.3 Plot Seasonal Decomposition
We plot the seaonal decomposition to get a visual representation;
plot_anomalies_decomp()
Use help(tk.plot_anomalies_decomp)
to review additional helpful documentation.
# Plot seasonal decomposition
tk.plot_anomalies_decomp(= anomalize_df,
data = 'date',
date_column = 'plotly',
engine = 'Seasonal Decomposition'
title )
1.4 Plot Anomalies
Next we can plot the anomalies using tk.plot_anomalies()
;
plot_anomalies()
Use help(tk.plot_anomalies)
to review additional helpful documentation.
# Plot anomalies
tk.plot_anomalies(= anomalize_df,
data = 'date',
date_column = 'plotly',
engine = 'Plot Anomaly Bands'
title )
1.5 Plot Cleaned Anomalies
Finally we can also see a plot of the data with cleaned anomalies using plot_anomalies_cleaned()
;
plot_anomalies_cleaned()
Use help(tk.plot_anomalies_cleaned)
to review additional helpful documentation.
# Plot cleaned anomalies
tk.plot_anomalies_cleaned(= anomalize_df,
data = 'date'
date_column )
1.6 Changing Parameters
Some important parameters to hightlight in the anomalize()
function include iqr_alpha
.
iqr_alpha
controls the threshold for detecting outliers. It is the significance level used in the interquartile range (IQR) method for outlier detection. The default value is 0.05, which corresponds to a 5% significance level. A lower significance level will result in a higher threshold, which means fewer outliers will be detected. A higher significance level will result in a lower threshold, which means more outliers will be detected.
Lets visualize the effect of changing the iqr_alpha
parameter;
Changing iqr_alpha
First, lets get a dataframe with multiple values for iqr_alpha
;
# Anomalized data with multiple iqr_alpha values
# - Alpha values
= [0.05, 0.10, 0.15, 0.20]
iqr_alpha_values
# - Empty dataframes list
= []
dfs
for alpha in iqr_alpha_values:
# - Run anomalize function
= tk.anomalize(
anomalize_df = df,
data = 'date',
date_column = 'value',
value_column = 7,
period = alpha
iqr_alpha
)
# - Add the iqr_alpha column
'iqr_alpha'] = f'iqr_alpha value of {alpha}'
anomalize_df[
# - Append to the list
dfs.append(anomalize_df)
# - Concatenate all dataframes
= pd.concat(dfs) final_df
Now we can visualize the anomalies:
Visualizing Grouped Anomalies (Facets)
# Visualize
(
final_df'iqr_alpha')
.groupby(
.plot_anomalies(= 'date',
date_column = 'plotly',
engine = 2
facet_ncol
) )
Visualizing Grouped Anomalies (Plotly Dropdown)
# Visualize
(
final_df'iqr_alpha')
.groupby(
.plot_anomalies(= 'date',
date_column = 'plotly',
engine = True,
plotly_dropdown = 1,
plotly_dropdown_x = 0.60
plotly_dropdown_y
) )
2 More Coming Soon…
We are in the early stages of development. But it’s obvious the potential for pytimetk
now in Python. 🐍
- Please ⭐ us on GitHub (it takes 2-seconds and means a lot).
- To make requests, please see our Project Roadmap GH Issue #2. You can make requests there.
- Want to contribute? See our contributing guide here.