anomalize

anomalize(
    data,
    date_column,
    value_column,
    period=None,
    trend=None,
    method='stl',
    decomp='additive',
    clean='min_max',
    iqr_alpha=0.05,
    clean_alpha=0.75,
    max_anomalies=0.2,
    bind_data=False,
    reduce_memory=False,
    threads=1,
    show_progress=True,
    verbose=False,
    engine='pandas',
)

Detects anomalies in time series data, either for a single time series or for multiple time series grouped by a specific column.

Parameters

Name	Type	Description	Default
data	DataFrame or GroupBy(pandas or polars)	The input data, which can be either a pandas/polars DataFrame or a grouped object.	required
date_column	str	The name of the column in the data that contains the dates or timestamps.	required
value_column	str	The name of the column in the data that contains the values to be analyzed for anomalies.	required
period	Optional[int]	The `period` parameter specifies the length of the seasonal component in the time series. It is used in the decomposition process to separate the time series into its seasonal, trend, and remainder components. If not specified, the function will automatically determine the period based on the data.	`None`
trend	Optional[int]	The `trend` parameter is an optional integer that specifies the length of the moving average window used for trend estimation. If `trend` is set to `None`, no trend estimation will be performed.	`None`
method	str	The `method` parameter determines the method used for anomaly detection. The only available method is `twitter`, which is the default value. More anomaly detection methods will be added in upcoming releases.	`'stl'`
decomp	str	The `decomp` parameter specifies the type of decomposition to use for time series decomposition. It can take two values: 1. ‘additive’ - This is the default value. It specifies that the time series will be decomposed using an additive model. 2. ‘multiplicative’ - This specifies that the time series will be decomposed using a multiplicative model.	`'additive'`
clean	str	The `clean` parameter specifies the method used to clean the anomalies. It can take two values: 1. ‘min_max’ - This specifies that the anomalies will be cleaned using the min-max method. This method replaces the anomalies with the 0.75 * lower or upper bound of the recomposed time series, depending on the direction of the anomaly. The 0.75 multiplier can be adjusted using the `clean_alpha` parameter. 2. ‘linear’ - This specifies that the anomalies will be cleaned using linear interpolation.	`'min_max'`
iqr_alpha	float	The `iqr_alpha` parameter is used to determine the threshold for detecting outliers. It is the significance level used in the interquartile range (IQR) method for outlier detection. - The default value is 0.05, which corresponds to a 5% significance level. - A lower significance level will result in a higher threshold, which means fewer outliers will be detected. - A higher significance level will result in a lower threshold, which means more outliers will be detected.	`0.05`
clean_alpha	float	The `clean_alpha` parameter is used to determine the threshold for cleaning the outliers. The default is 0.75, which means that the anomalies will be cleaned using the 0.75 * lower or upper bound of the recomposed time series, depending on the direction of the anomaly.	`0.75`
max_anomalies	float	The `max_anomalies` parameter is used to specify the maximum percentage of anomalies allowed in the data. It is a float value between 0 and 1. For example, if `max_anomalies` is set to 0.2, it means that the function will identify and remove outliers until the percentage of outliers in the data is less than or equal to 20%. The default value is 0.2.	`0.2`
bind_data	bool	The `bind_data` parameter determines whether the original data will be included in the output. If set to `True`, the original data will be included in the output dataframe. If set to `False`, only the anomalous data will be included.	`False`
reduce_memory	bool	The `reduce_memory` parameter is used to specify whether to reduce the memory usage of the DataFrame by converting int, float to smaller bytes and str to categorical data. This reduces memory for large data but may impact resolution of float and will change str to categorical. Default is True.	`False`
threads	int	The `threads` parameter specifies the number of threads to use for parallel processing. By default, it is set to `1`, which means no parallel processing is used. If you set `threads` to `-1`, it will use all available processors for parallel processing.	`1`
show_progress	bool	A boolean parameter that determines whether to show a progress bar during the execution of the function. If set to True, a progress bar will be displayed. If set to False, no progress bar will be shown.	`True`
verbose	bool	The `verbose` parameter is a boolean flag that determines whether or not to display additional information and progress updates during the execution of the `anomalize` function. If `verbose` is set to `True`, you will see more detailed output.	`False`
engine	(pandas, polars, auto)	Execution engine. `"pandas"` (default) performs the computation using pandas. `"polars"` converts the result to a polars DataFrame on return. `"auto"` infers the engine from the input data.	`"pandas"`

Returns

Name	Type	Description
	DataFrame	Returns the original data with additional anomaly diagnostics. The concrete type matches the engine used to process the data.
	- observed: original data
	- seasonal: seasonal component
	- seasadaj: seasonal adjusted
	- trend: trend component
	- remainder: residual component
	- anomaly: Yes/No flag for outlier detection
	- anomaly score: distance from centerline
	- anomaly direction: -1, 0, 1 inidicator for direction of the anomaly
	- recomposed_l1: lower level bound of recomposed time series
	- recomposed_l2: upper level bound of recomposed time series
	- observed_clean: original data with anomalies interpolated

Notes

Performance

This function uses parallel processing to speed up computation for large datasets with many time series groups:

Parallel processing has overhead and may not be faster on small datasets.

To use parallel processing, set threads = -1 to use all available processors.

Examples

# EXAMPLE 1: SINGLE TIME SERIES
import pytimetk as tk
import pandas as pd
import numpy as np

# Create a date range
date_rng = pd.date_range(start='2021-01-01', end='2024-01-01', freq='MS')

# Generate some random data with a few outliers
np.random.seed(42)
data = np.random.randn(len(date_rng)) * 10 + 25
data[3] = 100  # outlier

# Create a DataFrame
df = pd.DataFrame(date_rng, columns=['date'])
df['value'] = data

# Anomalize the data
anomalize_df = tk.anomalize(
    df, "date", "value",
    method = "twitter",
    iqr_alpha = 0.10,
    clean_alpha = 0.75,
    clean = "min_max",
    verbose = True,
)

anomalize_df.glimpse()

Using seasonal frequency of 12 observations
Using trend frequency of 37 observations
<class 'pandas.core.frame.DataFrame'>: 37 rows of 12 columns
date:               datetime64[ns]    [Timestamp('2021-01-01 00:00:00'), ...
observed:           float64           [29.96714153011233, 23.61735698828 ...
seasonal:           float64           [-0.866106186024778, -7.9678364801 ...
seasadj:            float64           [30.833247716137105, 31.5851934684 ...
trend:              float64           [23.2058900365944, 23.205890036594 ...
remainder:          float64           [7.627357679542705, 8.379303431869 ...
anomaly:            object            ['No', 'No', 'No', 'Yes', 'No', 'N ...
anomaly_score:      float64           [2.0477355229001795, 2.79968127522 ...
anomaly_direction:  int64             [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,  ...
recomposed_l1:      float64           [10.105464911591646, 3.00373461744 ...
recomposed_l2:      float64           [45.73334710283265, 38.63161680868 ...
observed_clean:     float64           [29.96714153011233, 23.61735698828 ...

# Visualize the results
anomalize_df.plot_anomalies_decomp("date")

# Visualize the anomaly bands
(
     anomalize_df
        .plot_anomalies(
            date_column = "date",
            engine = "plotly",
        )
)

# Get the anomalies
anomalize_df.query("anomaly=='Yes'")

	date	observed	seasonal	seasadj	trend	remainder	anomaly	anomaly_score	anomaly_direction	recomposed_l1	recomposed_l2	observed_clean
3	2021-04-01	100.000000	23.694997	76.305003	23.20589	53.099113	Yes	47.519491	1	34.666568	70.294450	65.840965
15	2022-04-01	19.377125	23.694997	-4.317872	23.20589	-27.523762	Yes	33.103384	-1	34.666568	70.294450	39.120053
19	2022-08-01	10.876963	3.852379	7.024584	23.20589	-16.181306	Yes	21.760928	-1	14.823950	50.451832	19.277435
27	2023-04-01	28.756980	23.694997	5.061983	23.20589	-18.143907	Yes	23.723529	-1	34.666568	70.294450	39.120053

# Visualize observed vs cleaned
anomalize_df.plot_anomalies_cleaned("date")

# EXAMPLE 2: MULTIPLE TIME SERIES
import pytimetk as tk
import pandas as pd

df = tk.load_dataset("wikipedia_traffic_daily", parse_dates = ['date'])

anomalize_df = (
    df
        .groupby('Page', sort = False)
        .anomalize(
            date_column = "date",
            value_column = "value",
            method = "stl",
            iqr_alpha = 0.025,
            verbose = False,
        )
)

# Visualize the decomposition results

(
    anomalize_df
        .groupby("Page")
        .plot_anomalies_decomp(
            date_column = "date",
            width = 1800,
            height = 1000,
            x_axis_date_labels = "%Y",
            engine = 'plotly'
        )
)

# Visualize the anomaly bands
(
    anomalize_df
        .groupby("Page")
        .plot_anomalies(
            date_column = "date",
            facet_ncol = 2,
            width = 1000,
            height = 1000,
        )
)

# Get the anomalies
anomalize_df.query("anomaly=='Yes'")

	Page	date	observed	seasonal	seasadj	trend	remainder	anomaly	anomaly_score	anomaly_direction	recomposed_l1	recomposed_l2	observed_clean
22	Death_of_Freddie_Gray_en.wikipedia.org_mobile-...	2015-07-23	2862	-84.812735	2946.812735	662.785639	2284.027096	Yes	2161.902398	1	-824.541763	2224.736968	1843.577127
23	Death_of_Freddie_Gray_en.wikipedia.org_mobile-...	2015-07-24	3215	-29.111150	3244.111150	659.814714	2584.296436	Yes	2462.171737	1	-771.811103	2277.467629	1896.307787
63	Death_of_Freddie_Gray_en.wikipedia.org_mobile-...	2015-09-02	6870	-11.180597	6881.180597	534.465753	6346.714844	Yes	6224.590145	1	-879.229510	2170.049221	1788.889380
64	Death_of_Freddie_Gray_en.wikipedia.org_mobile-...	2015-09-03	3480	-22.443332	3502.443332	530.613560	2971.829772	Yes	2849.705073	1	-894.344439	2154.934293	1773.774451
69	Death_of_Freddie_Gray_en.wikipedia.org_mobile-...	2015-09-08	8621	102.238093	8518.761907	512.499412	8006.262495	Yes	7884.137796	1	-787.777161	2261.501570	1880.341729
...	...	...	...	...	...	...	...	...	...	...	...	...	...
5325	Яшин,_Лев_Иванович_ru.wikipedia.org_mobile-web...	2016-07-10	3735	-96.655348	3831.655348	629.437128	3202.218220	Yes	3148.002144	1	-76.531003	1250.526716	1084.644501
5429	Яшин,_Лев_Иванович_ru.wikipedia.org_mobile-web...	2016-10-22	5115	-28.240168	5143.240168	437.912860	4705.327308	Yes	4651.111232	1	-199.640091	1127.417628	961.535413
5430	Яшин,_Лев_Иванович_ru.wikipedia.org_mobile-web...	2016-10-23	2654	243.593258	2410.406742	438.910163	1971.496580	Yes	1917.280503	1	73.190638	1400.248356	1234.366142
5431	Яшин,_Лев_Иванович_ru.wikipedia.org_mobile-web...	2016-10-24	1469	-85.588736	1554.588736	439.774159	1114.814577	Yes	1060.598500	1	-255.127359	1071.930359	906.048145
5483	Яшин,_Лев_Иванович_ru.wikipedia.org_mobile-web...	2016-12-15	1455	-27.070417	1482.070417	357.544096	1124.526321	Yes	1070.310244	1	-278.839103	1048.218616	882.336401

245 rows × 13 columns

# Visualize observed vs cleaned
(
    anomalize_df
        .groupby("Page")
        .plot_anomalies_cleaned(
            "date",
            facet_ncol = 2,
            width = 1000,
            height = 1000,
        )
)

# Polars DataFrame using the tk accessor
import pandas as pd
import polars as pl


sample = pd.DataFrame(
    {
        "date": pd.date_range(start="2021-01-01", periods=12, freq="MS"),
        "value": [10, 12, 13, 14, 50, 18, 19, 20, 21, 22, 23, 24],
    }
)

pl_df = pl.from_pandas(sample)

pl_df.tk.anomalize(
    date_column="date",
    value_column="value",
    period=12,
    method="stl",
    show_progress=False,
)

shape: (12, 12)

date	observed	seasonal	seasadj	trend	remainder	anomaly	anomaly_score	anomaly_direction	recomposed_l1	recomposed_l2	observed_clean
datetime[ns]	i64	f64	f64	f64	f64	str	f64	i64	f64	f64	f64
2021-01-01 00:00:00	10	-10.5	20.5	20.5	3.5527e-15	"No"	1.3323e-15	0	10.0	10.0	10.0
2021-02-01 00:00:00	12	-8.5	20.5	20.5	0.0	"No"	2.2204e-15	0	12.0	12.0	12.0
2021-03-01 00:00:00	13	-7.5	20.5	20.5	-3.5527e-15	"No"	5.7732e-15	0	13.0	13.0	13.0
2021-04-01 00:00:00	14	-6.5	20.5	20.5	-3.5527e-15	"No"	5.7732e-15	0	14.0	14.0	14.0
2021-05-01 00:00:00	50	29.5	20.5	20.5	3.5527e-15	"No"	1.3323e-15	0	50.0	50.0	50.0
…	…	…	…	…	…	…	…	…	…	…	…
2021-08-01 00:00:00	20	-0.5	20.5	20.5	-3.5527e-15	"No"	5.7732e-15	0	20.0	20.0	20.0
2021-09-01 00:00:00	21	0.5	20.5	20.5	-3.5527e-15	"No"	5.7732e-15	0	21.0	21.0	21.0
2021-10-01 00:00:00	22	1.5	20.5	20.5	-3.5527e-15	"No"	5.7732e-15	0	22.0	22.0	22.0
2021-11-01 00:00:00	23	2.5	20.5	20.5	-3.5527e-15	"No"	5.7732e-15	0	23.0	23.0	23.0
2021-12-01 00:00:00	24	3.5	20.5	20.5	-1.0658e-14	"Yes"	1.2879e-14	-1	24.0	24.0	24.0