Anomaly Detection in Website Traffic

Anomalize: Breakdown, identify, and clean anomalies in 1 easy step

Anomalies, often called outliers, are data points that deviate significantly from the general trend or pattern in the data. In the context of time series, they can appear as sudden spikes, drops, or any abrupt change in a sequence of values.

Anomaly detection for time series is a technique used to identify unusual patterns that do not conform to expected behavior. It is especially relevant for sequential data (like stock prices, sensor data, sales data, etc.) where the temporal aspect is crucial. Anomalies can identify important events or be the cause of noise that can hinder forecasting performance.

1 This applied tutorial covers the use of:

  • tk.anomalize(): A single function that integrates time series decomposition, anomaly identification (scoring), and outlier cleaning.
  • tk.plot_anomalies_decomp(): The first step towards identifying if your anomalization is detecting outliers to your needs.
  • tk.plot_anomalies(): The second step to visualize the anomalies.
  • tk.plot_anomalies_cleaned(): Compare the data with anomalies removed (before and after)
How to navigate this guide

This applied tutorial is separated into 2 parts:

  1. We have a quick start section called “5-Minutes to Anomalize” for those looking to jump right in.
  2. We also have a detailed section on parameter adjustment for those looking to understand what nobs they can turn.

2 Five (5) Minutes to Anomalize

Load these libraries to get started.

import pytimetk as tk
import pandas as pd

Next, get some data. We’ll use the wikipedia_traffic_daily data set that comes with anomalize. This contains data on various websites

We’ll glimpse() the data to get a sense of what we are working with.

df = tk.load_dataset("wikipedia_traffic_daily", parse_dates = ['date'])

<class 'pandas.core.frame.DataFrame'>: 5500 rows of 3 columns
Page:   object            ['Death_of_Freddie_Gray_en.wikipedia.org_mobil ...
date:   datetime64[ns]    [Timestamp('2015-07-01 00:00:00'), Timestamp(' ...
value:  int64             [791, 704, 903, 732, 558, 504, 543, 1156, 1196 ...
3 Core Properties of Time Series Data

We can see that the wikipedia_traffic_daily dataset has:

  1. Time Series Index: date
  2. Value Column(s): The value column
  3. Group Column(s): The Page column

Next, plot the time series.

df \
    .groupby('Page') \
        date_column = "date", 
        value_column = "value",
        facet_ncol = 2,
        width = 800,
        height = 800,
        engine = 'plotly'
df \
    .groupby('Page') \
        date_column = "date", 
        value_column = "value",
        facet_ncol = 2,
        width = 800,
        height = 800,
        engine = 'plotnine'

<Figure Size: (800 x 800)>

We can see there are some spikes, but are these anomalies? Let’s use anomalize() to detect.

2.1 Anomalize: breakdown, identify, and clean in 1 easy step

The anomalize() function is a feature rich tool for performing anomaly detection. Anomalize is group-aware, so we can use this as part of a normal pandas groupby chain. In one easy step:

  • We breakdown (decompose) the time series
  • Analyze it’s remainder (residuals) for spikes (anomalies)
  • Clean the anomalies if desired
anomalize_df = df \
    .groupby('Page', sort = False) \
        date_column = "date", 
        value_column = "value", 

<class 'pandas.core.frame.DataFrame'>: 5500 rows of 13 columns
Page:               object            ['Death_of_Freddie_Gray_en.wikiped ...
date:               datetime64[ns]    [Timestamp('2015-07-01 00:00:00'), ...
observed:           int64             [791, 704, 903, 732, 558, 504, 543 ...
seasonal:           float64           [206.78723511550484, 4.04332698700 ...
seasadj:            float64           [584.2127648844952, 699.9566730129 ...
trend:              float64           [729.0301895900458, 726.0497757616 ...
remainder:          float64           [-144.8174247055506, -26.093102748 ...
anomaly:            object            ['No', 'No', 'No', 'No', 'No', 'No ...
anomaly_score:      float64           [266.9421236324138, 148.2178016755 ...
anomaly_direction:  int64             [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  ...
recomposed_l1:      float64           [266.05095141435606, 60.3266294574 ...
recomposed_l2:      float64           [1849.8332958504716, 1644.10897389 ...
observed_clean:     float64           [791.0, 704.0, 903.0, 732.0, 558.0 ...
The anomalize() function returns:
  1. The original grouping and datetime columns.
  2. The seasonal decomposition: observed, seasonal, seasadj, trend, and remainder. The objective is to remove trend and seasonality such that the remainder is stationary and representative of normal variation and anomalous variations.
  3. Anomaly identification and scoring: anomaly, anomaly_score, anomaly_direction. These identify the anomaly decision (Yes/No), score the anomaly as a distance from the centerline, and label the direction (-1 (down), zero (not anomalous), +1 (up)).
  4. Recomposition: recomposed_l1 and recomposed_l2. Think of these as the lower and upper bands. Any observed data that is below l1 or above l2 is anomalous.
  5. Cleaned data: observed_clean. Cleaned data is automatically provided, which has the outliers replaced with data that is within the recomposed l1/l2 boundaries. With that said, you should always first seek to understand why data is being considered anomalous before simply removing outliers and using the cleaned data.

The most important aspect is that this data is ready to be visualized, inspected, and modifications can then be made to address any tweaks you would like to make.

2.2 Visualization 1: Seasonal Decomposition Plot

The first step in my normal process is to analyze the seasonal decomposition. I want to see what the remainders look like, and make sure that the trend and seasonality are being removed such that the remainder is centered around zero.

What to do when the remainders have trend or seasonality?

We’ll cover how to tweak the nobs of anomalize() in the next section aptly named “How to tweak the nobs on anomalize”.

anomalize_df \
    .groupby("Page") \
        date_column = "date", 
        width = 1800,
        height = 1000,
        engine = 'plotly'