anomalize

anomalize(data, date_column, value_column, period=None, trend=None, method='stl', decomp='additive', clean='min_max', iqr_alpha=0.05, clean_alpha=0.75, max_anomalies=0.2, bind_data=False, reduce_memory=False, threads=1, show_progress=True, verbose=False)

Detects anomalies in time series data, either for a single time series or for multiple time series grouped by a specific column.

Parameters

Name Type Description Default
data Union[pd.DataFrame, pd.core.groupby.generic.DataFrameGroupBy] The input data, which can be either a pandas DataFrame or a pandas DataFrameGroupBy object. required
date_column str The name of the column in the data that contains the dates or timestamps. required
value_column str The name of the column in the data that contains the values to be analyzed for anomalies. required
period Optional[int] The period parameter specifies the length of the seasonal component in the time series. It is used in the decomposition process to separate the time series into its seasonal, trend, and remainder components. If not specified, the function will automatically determine the period based on the data. None
trend Optional[int] The trend parameter is an optional integer that specifies the length of the moving average window used for trend estimation. If trend is set to None, no trend estimation will be performed. None
method str The method parameter determines the method used for anomaly detection. The only available method is twitter, which is the default value. More anomaly detection methods will be added in upcoming releases. 'stl'
decomp str The decomp parameter specifies the type of decomposition to use for time series decomposition. It can take two values: 1. ‘additive’ - This is the default value. It specifies that the time series will be decomposed using an additive model. 2. ‘multiplicative’ - This specifies that the time series will be decomposed using a multiplicative model. 'additive'
clean str The clean parameter specifies the method used to clean the anomalies. It can take two values: 1. ‘min_max’ - This specifies that the anomalies will be cleaned using the min-max method. This method replaces the anomalies with the 0.75 * lower or upper bound of the recomposed time series, depending on the direction of the anomaly. The 0.75 multiplier can be adjusted using the clean_alpha parameter. 2. ‘linear’ - This specifies that the anomalies will be cleaned using linear interpolation. 'min_max'
iqr_alpha float The iqr_alpha parameter is used to determine the threshold for detecting outliers. It is the significance level used in the interquartile range (IQR) method for outlier detection. - The default value is 0.05, which corresponds to a 5% significance level. - A lower significance level will result in a higher threshold, which means fewer outliers will be detected. - A higher significance level will result in a lower threshold, which means more outliers will be detected. 0.05
clean_alpha float The clean_alpha parameter is used to determine the threshold for cleaning the outliers. The default is 0.75, which means that the anomalies will be cleaned using the 0.75 * lower or upper bound of the recomposed time series, depending on the direction of the anomaly. 0.75
max_anomalies float The max_anomalies parameter is used to specify the maximum percentage of anomalies allowed in the data. It is a float value between 0 and 1. For example, if max_anomalies is set to 0.2, it means that the function will identify and remove outliers until the percentage of outliers in the data is less than or equal to 20%. The default value is 0.2. 0.2
bind_data bool The bind_data parameter determines whether the original data will be included in the output. If set to True, the original data will be included in the output dataframe. If set to False, only the anomalous data will be included. False
reduce_memory bool The reduce_memory parameter is used to specify whether to reduce the memory usage of the DataFrame by converting int, float to smaller bytes and str to categorical data. This reduces memory for large data but may impact resolution of float and will change str to categorical. Default is True. False
threads int The threads parameter specifies the number of threads to use for parallel processing. By default, it is set to 1, which means no parallel processing is used. If you set threads to -1, it will use all available processors for parallel processing. 1
show_progress bool A boolean parameter that determines whether to show a progress bar during the execution of the function. If set to True, a progress bar will be displayed. If set to False, no progress bar will be shown. True
verbose The verbose parameter is a boolean flag that determines whether or not to display additional information and progress updates during the execution of the anomalize function. If verbose is set to True, you will see more detailed output. False

Returns

Type Description
pd.DataFrame Returns a pandas DataFrame containing the original data with additional columns.
- observed: original data
- seasonal: seasonal component
- seasadaj: seasonal adjusted
- trend: trend component
- remainder: residual component
- anomaly: Yes/No flag for outlier detection
- anomaly score: distance from centerline
- anomaly direction: -1, 0, 1 inidicator for direction of the anomaly
- recomposed_l1: lower level bound of recomposed time series
- recomposed_l2: upper level bound of recomposed time series
- observed_clean: original data with anomalies interpolated

Notes

Performance

This function uses parallel processing to speed up computation for large datasets with many time series groups:

Parallel processing has overhead and may not be faster on small datasets.

To use parallel processing, set threads = -1 to use all available processors.

Examples

# EXAMPLE 1: SINGLE TIME SERIES
import pytimetk as tk
import pandas as pd
import numpy as np

# Create a date range
date_rng = pd.date_range(start='2021-01-01', end='2024-01-01', freq='MS')

# Generate some random data with a few outliers
np.random.seed(42)
data = np.random.randn(len(date_rng)) * 10 + 25  
data[3] = 100  # outlier

# Create a DataFrame
df = pd.DataFrame(date_rng, columns=['date'])
df['value'] = data

# Anomalize the data
anomalize_df = tk.anomalize(
    df, "date", "value",
    method = "twitter", 
    iqr_alpha = 0.10, 
    clean_alpha = 0.75,
    clean = "min_max",
    verbose = True,
)

anomalize_df.glimpse()
Using seasonal frequency of 12 observations
Using trend frequency of 37 observations
<class 'pandas.core.frame.DataFrame'>: 37 rows of 12 columns
date:               datetime64[ns]    [Timestamp('2021-01-01 00:00:00'), ...
observed:           float64           [29.96714153011233, 23.61735698828 ...
seasonal:           float64           [-0.8661061860247758, -7.967836480 ...
seasadj:            float64           [30.833247716137105, 31.5851934684 ...
trend:              float64           [23.205890036594397, 23.2058900365 ...
remainder:          float64           [7.627357679542708, 8.379303431869 ...
anomaly:            object            ['No', 'No', 'No', 'Yes', 'No', 'N ...
anomaly_score:      float64           [2.047735522900185, 2.799681275227 ...
anomaly_direction:  int64             [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,  ...
recomposed_l1:      float64           [10.105464911591639, 3.00373461744 ...
recomposed_l2:      float64           [45.73334710283265, 38.63161680868 ...
observed_clean:     float64           [29.96714153011233, 23.61735698828 ...
# Visualize the results
anomalize_df.plot_anomalies_decomp("date")
# Visualize the anomaly bands
(
     anomalize_df
        .plot_anomalies(
            date_column = "date",
            engine = "plotly",
        )
)
# Get the anomalies    
anomalize_df.query("anomaly=='Yes'")
date observed seasonal seasadj trend remainder anomaly anomaly_score anomaly_direction recomposed_l1 recomposed_l2 observed_clean
3 2021-04-01 100.000000 23.694997 76.305003 23.20589 53.099113 Yes 47.519491 1 34.666568 70.294450 65.840965
15 2022-04-01 19.377125 23.694997 -4.317872 23.20589 -27.523762 Yes 33.103384 -1 34.666568 70.294450 39.120053
19 2022-08-01 10.876963 3.852379 7.024584 23.20589 -16.181306 Yes 21.760928 -1 14.823950 50.451832 19.277435
27 2023-04-01 28.756980 23.694997 5.061983 23.20589 -18.143907 Yes 23.723529 -1 34.666568 70.294450 39.120053
# Visualize observed vs cleaned
anomalize_df.plot_anomalies_cleaned("date")
# EXAMPLE 2: MULTIPLE TIME SERIES
import pytimetk as tk
import pandas as pd

df = tk.load_dataset("wikipedia_traffic_daily", parse_dates = ['date'])

anomalize_df = (
    df 
        .groupby('Page', sort = False) 
        .anomalize(
            date_column = "date", 
            value_column = "value",
            method = "stl", 
            iqr_alpha = 0.025,
            verbose = False,
        )
)

# Visualize the decomposition results

(
    anomalize_df 
        .groupby("Page") 
        .plot_anomalies_decomp(
            date_column = "date", 
            width = 1800,
            height = 1000,
            x_axis_date_labels = "%Y",
            engine = 'plotly'
        )
)
# Visualize the anomaly bands
(
    anomalize_df 
        .groupby("Page") 
        .plot_anomalies(
            date_column = "date", 
            facet_ncol = 2, 
            width = 1000,
            height = 1000,
        )
)