anomalize

anomalize(data, date_column, value_column, period=None, trend=None, method='stl', decomp='additive', clean='min_max', iqr_alpha=0.05, clean_alpha=0.75, max_anomalies=0.2, bind_data=False, reduce_memory=False, threads=1, show_progress=True, verbose=False)

Detects anomalies in time series data, either for a single time series or for multiple time series grouped by a specific column.

Parameters

Name Type Description Default
data Union[pd.DataFrame, pd.core.groupby.generic.DataFrameGroupBy] The input data, which can be either a pandas DataFrame or a pandas DataFrameGroupBy object. required
date_column str The name of the column in the data that contains the dates or timestamps. required
value_column str The name of the column in the data that contains the values to be analyzed for anomalies. required
period Optional[int] The period parameter specifies the length of the seasonal component in the time series. It is used in the decomposition process to separate the time series into its seasonal, trend, and remainder components. If not specified, the function will automatically determine the period based on the data. None
trend Optional[int] The trend parameter is an optional integer that specifies the length of the moving average window used for trend estimation. If trend is set to None, no trend estimation will be performed. None
method str The method parameter determines the method used for anomaly detection. The only available method is twitter, which is the default value. More anomaly detection methods will be added in upcoming releases. 'stl'
decomp str The decomp parameter specifies the type of decomposition to use for time series decomposition. It can take two values: 1. ‘additive’ - This is the default value. It specifies that the time series will be decomposed using an additive model. 2. ‘multiplicative’ - This specifies that the time series will be decomposed using a multiplicative model. 'additive'
clean str The clean parameter specifies the method used to clean the anomalies. It can take two values: 1. ‘min_max’ - This specifies that the anomalies will be cleaned using the min-max method. This method replaces the anomalies with the 0.75 * lower or upper bound of the recomposed time series, depending on the direction of the anomaly. The 0.75 multiplier can be adjusted using the clean_alpha parameter. 2. ‘linear’ - This specifies that the anomalies will be cleaned using linear interpolation. 'min_max'
iqr_alpha float The iqr_alpha parameter is used to determine the threshold for detecting outliers. It is the significance level used in the interquartile range (IQR) method for outlier detection. - The default value is 0.05, which corresponds to a 5% significance level. - A lower significance level will result in a higher threshold, which means fewer outliers will be detected. - A higher significance level will result in a lower threshold, which means more outliers will be detected. 0.05
clean_alpha float The clean_alpha parameter is used to determine the threshold for cleaning the outliers. The default is 0.75, which means that the anomalies will be cleaned using the 0.75 * lower or upper bound of the recomposed time series, depending on the direction of the anomaly. 0.75
max_anomalies float The max_anomalies parameter is used to specify the maximum percentage of anomalies allowed in the data. It is a float value between 0 and 1. For example, if max_anomalies is set to 0.2, it means that the function will identify and remove outliers until the percentage of outliers in the data is less than or equal to 20%. The default value is 0.2. 0.2
bind_data bool The bind_data parameter determines whether the original data will be included in the output. If set to True, the original data will be included in the output dataframe. If set to False, only the anomalous data will be included. False
reduce_memory bool The reduce_memory parameter is used to specify whether to reduce the memory usage of the DataFrame by converting int, float to smaller bytes and str to categorical data. This reduces memory for large data but may impact resolution of float and will change str to categorical. Default is True. False
threads int The threads parameter specifies the number of threads to use for parallel processing. By default, it is set to 1, which means no parallel processing is used. If you set threads to -1, it will use all available processors for parallel processing. 1
show_progress bool A boolean parameter that determines whether to show a progress bar during the execution of the function. If set to True, a progress bar will be displayed. If set to False, no progress bar will be shown. True
verbose The verbose parameter is a boolean flag that determines whether or not to display additional information and progress updates during the execution of the anomalize function. If verbose is set to True, you will see more detailed output. False

Returns

Type Description
pd.DataFrame Returns a pandas DataFrame containing the original data with additional columns.
- observed: original data
- seasonal: seasonal component
- seasadaj: seasonal adjusted
- trend: trend component
- remainder: residual component
- anomaly: Yes/No flag for outlier detection
- anomaly score: distance from centerline
- anomaly direction: -1, 0, 1 inidicator for direction of the anomaly
- recomposed_l1: lower level bound of recomposed time series
- recomposed_l2: upper level bound of recomposed time series
- observed_clean: original data with anomalies interpolated

Notes

Performance

This function uses parallel processing to speed up computation for large datasets with many time series groups:

Parallel processing has overhead and may not be faster on small datasets.

To use parallel processing, set threads = -1 to use all available processors.

Examples

# EXAMPLE 1: SINGLE TIME SERIES
import pytimetk as tk
import pandas as pd
import numpy as np

# Create a date range
date_rng = pd.date_range(start='2021-01-01', end='2024-01-01', freq='MS')

# Generate some random data with a few outliers
np.random.seed(42)
data = np.random.randn(len(date_rng)) * 10 + 25  
data[3] = 100  # outlier

# Create a DataFrame
df = pd.DataFrame(date_rng, columns=['date'])
df['value'] = data

# Anomalize the data
anomalize_df = tk.anomalize(
    df, "date", "value",
    method = "twitter", 
    iqr_alpha = 0.10, 
    clean_alpha = 0.75,
    clean = "min_max",
    verbose = True,
)

anomalize_df.glimpse()
Using seasonal frequency of 12 observations
Using trend frequency of 37 observations
<class 'pandas.core.frame.DataFrame'>: 37 rows of 12 columns
date:               datetime64[ns]    [Timestamp('2021-01-01 00:00:00'), ...
observed:           float64           [29.96714153011233, 23.61735698828 ...
seasonal:           float64           [-0.8661061860247741, -7.967836480 ...
seasadj:            float64           [30.8332477161371, 31.585193468464 ...
trend:              float64           [23.205890036594397, 23.2058900365 ...
remainder:          float64           [7.627357679542705, 8.379303431869 ...
anomaly:            object            ['No', 'No', 'No', 'Yes', 'No', 'N ...
anomaly_score:      float64           [2.0477355229001812, 2.79968127522 ...
anomaly_direction:  int64             [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,  ...
recomposed_l1:      float64           [10.105464911591643, 3.00373461744 ...
recomposed_l2:      float64           [45.73334710283265, 38.63161680868 ...
observed_clean:     float64           [29.96714153011233, 23.61735698828 ...
# Visualize the results
anomalize_df.plot_anomalies_decomp("date")
# Visualize the anomaly bands
(
     anomalize_df
        .plot_anomalies(
            date_column = "date",
            engine = "plotly",
        )
)
# Get the anomalies    
anomalize_df.query("anomaly=='Yes'")
date observed seasonal seasadj trend remainder anomaly anomaly_score anomaly_direction recomposed_l1 recomposed_l2 observed_clean
3 2021-04-01 100.000000 23.694997 76.305003 23.20589 53.099113 Yes 47.519491 1 34.666568 70.294450 65.840965
15 2022-04-01 19.377125 23.694997 -4.317872 23.20589 -27.523762 Yes 33.103384 -1 34.666568 70.294450 39.120053
19 2022-08-01 10.876963 3.852379 7.024584 23.20589 -16.181306 Yes 21.760928 -1 14.823950 50.451832 19.277435
27 2023-04-01 28.756980 23.694997 5.061983 23.20589 -18.143907 Yes 23.723529 -1 34.666568 70.294450 39.120053
# Visualize observed vs cleaned
anomalize_df.plot_anomalies_cleaned("date")
# EXAMPLE 2: MULTIPLE TIME SERIES
import pytimetk as tk
import pandas as pd

df = tk.load_dataset("wikipedia_traffic_daily", parse_dates = ['date'])

anomalize_df = (
    df 
        .groupby('Page', sort = False) 
        .anomalize(
            date_column = "date", 
            value_column = "value",
            method = "stl", 
            iqr_alpha = 0.025,
            verbose = False,
        )
)

# Visualize the decomposition results

(
    anomalize_df 
        .groupby("Page") 
        .plot_anomalies_decomp(
            date_column = "date", 
            width = 1800,
            height = 1000,
            x_axis_date_labels = "%Y",
            engine = 'plotly'
        )
)
# Visualize the anomaly bands
(
    anomalize_df 
        .groupby("Page") 
        .plot_anomalies(
            date_column = "date", 
            facet_ncol = 2, 
            width = 1000,
            height = 1000,
        )
)
# Get the anomalies    
anomalize_df.query("anomaly=='Yes'")
Page date observed seasonal seasadj trend remainder anomaly anomaly_score anomaly_direction recomposed_l1 recomposed_l2 observed_clean
22 Death_of_Freddie_Gray_en.wikipedia.org_mobile-... 2015-07-23 2862 -84.812735 2946.812735 662.785639 2284.027096 Yes 2161.902398 1 -824.541763 2224.736968 1843.577127
23 Death_of_Freddie_Gray_en.wikipedia.org_mobile-... 2015-07-24 3215 -29.111150 3244.111150 659.814714 2584.296436 Yes 2462.171737 1 -771.811103 2277.467629 1896.307787
63 Death_of_Freddie_Gray_en.wikipedia.org_mobile-... 2015-09-02 6870 -11.180597 6881.180597 534.465753 6346.714844 Yes 6224.590145 1 -879.229510 2170.049221 1788.889380
64 Death_of_Freddie_Gray_en.wikipedia.org_mobile-... 2015-09-03 3480 -22.443332 3502.443332 530.613560 2971.829772 Yes 2849.705073 1 -894.344439 2154.934293 1773.774451
69 Death_of_Freddie_Gray_en.wikipedia.org_mobile-... 2015-09-08 8621 102.238093 8518.761907 512.499412 8006.262495 Yes 7884.137796 1 -787.777161 2261.501570 1880.341729
... ... ... ... ... ... ... ... ... ... ... ... ... ...
5325 Яшин,_Лев_Иванович_ru.wikipedia.org_mobile-web... 2016-07-10 3735 -96.655348 3831.655348 629.437128 3202.218220 Yes 3148.002144 1 -76.531003 1250.526716 1084.644501
5429 Яшин,_Лев_Иванович_ru.wikipedia.org_mobile-web... 2016-10-22 5115 -28.240168 5143.240168 437.912860 4705.327308 Yes 4651.111232 1 -199.640091 1127.417628 961.535413
5430 Яшин,_Лев_Иванович_ru.wikipedia.org_mobile-web... 2016-10-23 2654 243.593258 2410.406742 438.910163 1971.496580 Yes 1917.280503 1 73.190638 1400.248356 1234.366142
5431 Яшин,_Лев_Иванович_ru.wikipedia.org_mobile-web... 2016-10-24 1469 -85.588736 1554.588736 439.774159 1114.814577 Yes 1060.598500 1 -255.127359 1071.930359 906.048145
5483 Яшин,_Лев_Иванович_ru.wikipedia.org_mobile-web... 2016-12-15 1455 -27.070417 1482.070417 357.544096 1124.526321 Yes 1070.310244 1 -278.839103 1048.218616 882.336401

245 rows × 13 columns

# Visualize observed vs cleaned
(
    anomalize_df
        .groupby("Page") 
        .plot_anomalies_cleaned(
            "date", 
            facet_ncol = 2, 
            width = 1000,
            height = 1000,
        )
)