Adding Features (Augmenting)

This section will cover the augment set of functions, use to add many additional time series features to a dataset. We’ll cover how to use the following set of functions

augment_lags()
augment_leads()
augment_rolling()
augment_time_series_signature()
augment_holiday_signature()
augment_fourier()

1 Augment Lags / Leads

Lags are commonly used in time series forecasting to incorportate the past values of a feature as predictors. Leads, while not as common as Lags in time series might be useful in scenarios where you want to predict a future value based on other future values.

Help Doc Info: augment_lag(), augment_leads()

Use help(tk.augment_lags) and help(tk.augment_leads) to review additional helpful documentation.

1.1 Basic Examples

Add 1 or more lags / leads to a dataset:

Code

# import libraries
import pytimetk as tk
import pandas as pd
import numpy as np
import random

# create sample data
dates = pd.date_range(start = '2023-09-18', end = '2023-09-24')
values = [random.randint(10, 50) for _ in range(7)]

df = pd.DataFrame({
    'date': dates,
    'value': values
})

df

	date	value
0	2023-09-18	25
1	2023-09-19	50
2	2023-09-20	49
3	2023-09-21	45
4	2023-09-22	48
5	2023-09-23	18
6	2023-09-24	18

Create lag / lead of 3 days:

Lag
Lead

Code

# augment lag
df \
    .augment_lags(
        date_column  = 'date',
        value_column = 'value',
        lags         = 3
    )

	date	value	value_lag_3
0	2023-09-18	25	NaN
1	2023-09-19	50	NaN
2	2023-09-20	49	NaN
3	2023-09-21	45	25.0
4	2023-09-22	48	50.0
5	2023-09-23	18	49.0
6	2023-09-24	18	45.0

Code

# augment leads
df \
    .augment_leads(
        date_column  = 'date',
        value_column = 'value',
        leads        = 3
    )

	date	value	value_lead_3
0	2023-09-18	25	45.0
1	2023-09-19	50	48.0
2	2023-09-20	49	18.0
3	2023-09-21	45	18.0
4	2023-09-22	48	NaN
5	2023-09-23	18	NaN
6	2023-09-24	18	NaN

We can create multiple lag / lead values for a single time series:

Lag
Lead

Code

# multiple lagged values for a single time series
df \
    .augment_lags(
        date_column  = 'date',
        value_column = 'value',
        lags         = (1, 3)
    )

	date	value	value_lag_1	value_lag_2	value_lag_3
0	2023-09-18	25	NaN	NaN	NaN
1	2023-09-19	50	25.0	NaN	NaN
2	2023-09-20	49	50.0	25.0	NaN
3	2023-09-21	45	49.0	50.0	25.0
4	2023-09-22	48	45.0	49.0	50.0
5	2023-09-23	18	48.0	45.0	49.0
6	2023-09-24	18	18.0	48.0	45.0

Code

# multiple leads values for a single time series
df \
    .augment_leads(
        date_column  = 'date',
        value_column = 'value',
        leads        = (1, 3)
    )

	date	value	value_lead_1	value_lead_2	value_lead_3
0	2023-09-18	25	50.0	49.0	45.0
1	2023-09-19	50	49.0	45.0	48.0
2	2023-09-20	49	45.0	48.0	18.0
3	2023-09-21	45	48.0	18.0	18.0
4	2023-09-22	48	18.0	18.0	NaN
5	2023-09-23	18	18.0	NaN	NaN
6	2023-09-24	18	NaN	NaN	NaN

1.2 Augment Lags / Leads For Grouped Time Series

augment_lags() and augment_leads() also works for grouped time series data. Lets use the m4_daily_df dataset to showcase examples:

Code

# load m4_daily_df
m4_daily_df = tk.load_dataset('m4_daily', parse_dates = ['date'])

Lag
Lead

Code

# agument lags for grouped time series
m4_daily_df \
    .groupby("id") \
    .augment_lags(
        date_column  = 'date',
        value_column = 'value',
        lags         = (1, 7)
    )

	id	date	value	value_lag_1	value_lag_2	value_lag_3	value_lag_4	value_lag_5	value_lag_6	value_lag_7
0	D10	2014-07-03	2076.2	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	D10	2014-07-04	2073.4	2076.2	NaN	NaN	NaN	NaN	NaN	NaN
2	D10	2014-07-05	2048.7	2073.4	2076.2	NaN	NaN	NaN	NaN	NaN
3	D10	2014-07-06	2048.9	2048.7	2073.4	2076.2	NaN	NaN	NaN	NaN
4	D10	2014-07-07	2006.4	2048.9	2048.7	2073.4	2076.2	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...
9738	D500	2012-09-19	9418.8	9431.9	9437.7	9474.6	9359.2	9286.9	9265.4	9091.4
9739	D500	2012-09-20	9365.7	9418.8	9431.9	9437.7	9474.6	9359.2	9286.9	9265.4
9740	D500	2012-09-21	9445.9	9365.7	9418.8	9431.9	9437.7	9474.6	9359.2	9286.9
9741	D500	2012-09-22	9497.9	9445.9	9365.7	9418.8	9431.9	9437.7	9474.6	9359.2
9742	D500	2012-09-23	9545.3	9497.9	9445.9	9365.7	9418.8	9431.9	9437.7	9474.6

9743 rows × 10 columns

Code

# augment leads for grouped time series
m4_daily_df \
    .groupby("id") \
    .augment_leads(
        date_column  = 'date',
        value_column = 'value',
        leads        = (1, 7)
    )

	id	date	value	value_lead_1	value_lead_2	value_lead_3	value_lead_4	value_lead_5	value_lead_6	value_lead_7
0	D10	2014-07-03	2076.2	2073.4	2048.7	2048.9	2006.4	2017.6	2019.1	2007.4
1	D10	2014-07-04	2073.4	2048.7	2048.9	2006.4	2017.6	2019.1	2007.4	2010.0
2	D10	2014-07-05	2048.7	2048.9	2006.4	2017.6	2019.1	2007.4	2010.0	2001.5
3	D10	2014-07-06	2048.9	2006.4	2017.6	2019.1	2007.4	2010.0	2001.5	1978.8
4	D10	2014-07-07	2006.4	2017.6	2019.1	2007.4	2010.0	2001.5	1978.8	1988.3
...	...	...	...	...	...	...	...	...	...	...
9738	D500	2012-09-19	9418.8	9365.7	9445.9	9497.9	9545.3	NaN	NaN	NaN
9739	D500	2012-09-20	9365.7	9445.9	9497.9	9545.3	NaN	NaN	NaN	NaN
9740	D500	2012-09-21	9445.9	9497.9	9545.3	NaN	NaN	NaN	NaN	NaN
9741	D500	2012-09-22	9497.9	9545.3	NaN	NaN	NaN	NaN	NaN	NaN
9742	D500	2012-09-23	9545.3	NaN	NaN	NaN	NaN	NaN	NaN	NaN

9743 rows × 10 columns

2 Augment Rolling

A Rolling Window refers to a specific-sized subset of time series data that moves sequentially over the dataset.

Rolling windows play a crucial role in time series forecasting due to their ability to smooth out data, highlight seasonality, and detect anomalies.

augment_rolling() applies multiple rolling window functions with varying window sizes to time series data.

Help Doc Info: augment_rolling()

Use help(tk.augment_rolling) to review additional helpful documentation.

2.1 Basic Examples

We’ll continue with the use of our sample df created earlier:

Code

# window = 3 days, window function = mean
df \
    .augment_rolling(
        date_column  = 'date',
        value_column = 'value',
        window       = 3,
        window_func  = 'mean'
    )

	date	value	value_rolling_mean_win_3
0	2023-09-18	25	NaN
1	2023-09-19	50	NaN
2	2023-09-20	49	41.333333
3	2023-09-21	45	48.000000
4	2023-09-22	48	47.333333
5	2023-09-23	18	37.000000
6	2023-09-24	18	28.000000

It is important to understand how the center parameter in augment_rolling() works.

center

When set to True (default) the value of the rolling window will be centered, meaning that the value at the center of the window will be used as the result. When set to False (default) the rolling window will not be centered, meaning that the value at the end of the window will be used as the result.

Lets see an example:

Augment Rolling: Center = True
Augment Rolling: Center = False

Code

# agument rolling: center = true
df \
    .augment_rolling(
        date_column  = 'date',
        value_column = 'value',
        window       = 3,
        window_func  = 'mean',
        center       = True
    )

	date	value	value_rolling_mean_win_3
0	2023-09-18	25	NaN
1	2023-09-19	50	41.333333
2	2023-09-20	49	48.000000
3	2023-09-21	45	47.333333
4	2023-09-22	48	37.000000
5	2023-09-23	18	28.000000
6	2023-09-24	18	NaN

Code

# agument rolling: center = false
df \
    .augment_rolling(
        date_column  = 'date',
        value_column = 'value',
        window       = 3,
        window_func  = 'mean',
        center       = False
    )

	date	value	value_rolling_mean_win_3
0	2023-09-18	25	NaN
1	2023-09-19	50	NaN
2	2023-09-20	49	41.333333
3	2023-09-21	45	48.000000
4	2023-09-22	48	47.333333
5	2023-09-23	18	37.000000
6	2023-09-24	18	28.000000

Note that we are using a 3 day rolling window and applying a mean to value. In simplier terms, value_rolling_mean_win_3 is a 3 day rolling average of value with center set to False. Thus the function starts computing the mean from 2023-09-20. The same value for 2023-19-18 and 2023-09-19 are returned as value_rolling_mean_win_3 since it did not detected the third to apply the 3 day rolling average.

2.2 Augment Rolling with Multiple Windows and Window Functions

Multiple window functions can be passed to the window and window_func parameters:

Code

# augment rolling: window of 2 & 7 days, window_func of mean and standard deviation
m4_daily_df \
    .query('id == "D10"') \
    .augment_rolling(
                date_column = 'date',
                value_column = 'value',
                window = [2,7],
                window_func = ['mean', ('std', lambda x: x.std())]
            )

	id	date	value	value_rolling_mean_win_2	value_rolling_std_win_2	value_rolling_mean_win_7	value_rolling_std_win_7
0	D10	2014-07-03	2076.2	NaN	NaN	NaN	NaN
1	D10	2014-07-04	2073.4	2074.80	1.40	2074.800000	1.400000
2	D10	2014-07-05	2048.7	2061.05	12.35	2066.100000	12.356645
3	D10	2014-07-06	2048.9	2048.80	0.10	2061.800000	13.037830
4	D10	2014-07-07	2006.4	2027.65	21.25	2050.720000	25.041038
...	...	...	...	...	...	...	...
669	D10	2016-05-02	2630.7	2615.85	14.85	2579.471429	28.868159
670	D10	2016-05-03	2649.3	2640.00	9.30	2594.800000	33.081631
671	D10	2016-05-04	2631.8	2640.55	8.75	2601.371429	35.145563
672	D10	2016-05-05	2622.5	2627.15	4.65	2607.457143	34.584508
673	D10	2016-05-06	2620.1	2621.30	1.20	2618.328571	22.923270

674 rows × 7 columns

2.3 Augment Rolling with Grouped Time Series

agument_rolling can be used on grouped time series data:

Code

## augment rolling on grouped time series: window of 2 & 7 days, window_func of mean and standard deviation
m4_daily_df \
    .groupby('id') \
    .augment_rolling(
                date_column = 'date',
                value_column = 'value',
                window = [2,7],
                window_func = ['mean', ('std', lambda x: x.std())]
            )

	id	date	value	value_rolling_mean_win_2	value_rolling_std_win_2	value_rolling_mean_win_7	value_rolling_std_win_7
0	D10	2014-07-03	2076.2	NaN	NaN	NaN	NaN
1	D10	2014-07-04	2073.4	2074.80	1.40	2074.800000	1.400000
2	D10	2014-07-05	2048.7	2061.05	12.35	2066.100000	12.356645
3	D10	2014-07-06	2048.9	2048.80	0.10	2061.800000	13.037830
4	D10	2014-07-07	2006.4	2027.65	21.25	2050.720000	25.041038
...	...	...	...	...	...	...	...
9738	D500	2012-09-19	9418.8	9425.35	6.55	9382.071429	74.335988
9739	D500	2012-09-20	9365.7	9392.25	26.55	9396.400000	58.431303
9740	D500	2012-09-21	9445.9	9405.80	40.10	9419.114286	39.184451
9741	D500	2012-09-22	9497.9	9471.90	26.00	9438.928571	38.945336
9742	D500	2012-09-23	9545.3	9521.60	23.70	9449.028571	53.379416

9743 rows × 7 columns

3 Augment Time Series Signature

augment_timeseries_signature() is designed to assist in generating additional features from a given date column.

Help Doc Info: augment_timeseries_signature()

Use help(tk.augment_timeseries_signature) to review additional helpful documentation.

3.1 Basic Example

We’ll showcase an example using the m4_daily_df dataset by generating 29 additional features from the date column:

Code

# augment time series signature
m4_daily_df \
    .query('id == "D10"') \
    .augment_timeseries_signature(
        date_column = 'date'
    ) \
    .head()

	id	date	value	date_index_num	date_year	date_year_iso	date_half	...	date_mday	date_qday	date_yday	date_weekend	date_am_pm
0	D10	2014-07-03	2076.2	1404345600	2014	2014	2	...	3	3	184	0	am
1	D10	2014-07-04	2073.4	1404432000	2014	2014	2	...	4	4	185	0	am
2	D10	2014-07-05	2048.7	1404518400	2014	2014	2	...	5	5	186	0	am
3	D10	2014-07-06	2048.9	1404604800	2014	2014	2	...	6	6	187	1	am
4	D10	2014-07-07	2006.4	1404691200	2014	2014	2	...	7	7	188	0	am

5 rows × 32 columns

4 Augment Holiday Signature

augment_holiday_signature() is used to flag holidays from a date column based on date and country.

Help Doc Info: augment_holiday_signature()

Use help(tk.augment_holiday_signature) to review additional helpful documentation.

4.1 Basic Example

We’ll showcase an example using some sample data:

Code

# create sample data
dates = pd.date_range(start = '2022-12-25', end = '2023-01-05')

df = pd.DataFrame({'date': dates})

# augment time series signature: USA
df \
    .augment_holiday_signature(
        date_column  = 'date',
        country_name = 'UnitedStates'
    )

	date	is_holiday	before_holiday	after_holiday	holiday_name
0	2022-12-25	1	1	0	Christmas Day
1	2022-12-26	1	0	1	Christmas Day (Observed)
2	2022-12-27	0	0	1	NaN
3	2022-12-28	0	0	0	NaN
4	2022-12-29	0	0	0	NaN
5	2022-12-30	0	0	0	NaN
6	2022-12-31	0	1	0	NaN
7	2023-01-01	1	1	0	New Year's Day
8	2023-01-02	1	0	1	New Year's Day (Observed)
9	2023-01-03	0	0	1	NaN
10	2023-01-04	0	0	0	NaN
11	2023-01-05	0	0	0	NaN

5 Augment Fourier

augment_fourier() is used to add mutiple fourier series to time series data. Fourier transformation is often used as a feature engineering technique in time series forecasting as it helps detect hidden periodicities and cyclic patterns in the data. Capturing these hidden cyclic patterns can help improve predictive performance.

Help Doc Info: augment_fourier()

Use help(tk.augment_fourier) to review additional helpful documentation.

5.1 Basic Example

Code

# augment fourier with 7 periods and max order of 1
#m4_daily_df \
#    .query('id == "D10"') \
#    .augment_fourier(
#       date_column  = 'date',
#       value_column = 'value',
#       num_periods  = 7,
#       max_order    = 1
#    ) \
#   .head(20)

Notice the additional value_fourier_1_1 to value_fourier_1_7 colums that have been added to the data.

5.2 Augment Fourier with Grouped Time Series

augment_fourier also works with grouped time series:

Code

# augment fourier with grouped time series
m4_daily_df \
    .groupby('id') \
    .augment_fourier(
        date_column  = 'date',
        value_column = 'value',
        num_periods  = 7,
        max_order    = 1
    ) \
    .head(20)

	id	date	value	value_fourier_1_1	value_fourier_1_2	value_fourier_1_3	value_fourier_1_4	value_fourier_1_5	value_fourier_1_6	value_fourier_1_7
0	D10	2014-07-03	2076.2	0.394510	-0.725024	0.937927	-0.998682	0.897435	-0.650609	0.298243
1	D10	2014-07-04	2073.4	-0.980653	0.383931	0.830342	-0.709015	-0.552759	0.925423	0.190450
2	D10	2014-07-05	2048.7	0.011484	0.022967	0.034446	0.045921	0.057390	0.068852	0.080304
3	D10	2014-07-06	2048.9	0.975899	-0.425928	-0.790004	0.770723	0.453624	-0.968706	-0.030835
4	D10	2014-07-07	2006.4	-0.415510	0.755886	-0.959581	0.989762	-0.840972	0.540115	-0.141593
5	D10	2014-07-08	2017.6	-0.803876	-0.956286	-0.333715	0.559301	0.999055	0.629169	-0.250600
6	D10	2014-07-09	2019.1	0.748318	0.992779	0.568784	-0.238184	-0.884778	-0.935635	-0.356511
7	D10	2014-07-10	2007.4	0.494070	-0.859111	0.999790	-0.879368	0.529294	-0.040992	-0.458015
8	D10	2014-07-11	2010.0	-0.952864	0.578192	0.602021	-0.943494	-0.029515	0.961404	-0.553858
9	D10	2014-07-12	2001.5	-0.099581	-0.198171	-0.294792	-0.388482	-0.478310	-0.563384	-0.642856
10	D10	2014-07-13	1978.8	0.994091	-0.215816	-0.947238	0.421459	0.855740	-0.607239	-0.723909
11	D10	2014-07-14	1988.3	-0.311977	0.592812	-0.814472	0.954831	-0.999879	0.945118	-0.796015
12	D10	2014-07-15	2000.7	-0.864932	-0.868201	-0.006551	0.861625	0.871433	0.013101	-0.858282
13	D10	2014-07-16	2010.5	0.670062	0.994781	0.806801	0.203005	-0.505418	-0.953354	-0.909941
14	D10	2014-07-17	2014.5	0.587524	-0.950856	0.951356	-0.588831	0.001617	0.586214	-0.950354
15	D10	2014-07-18	1962.6	-0.913299	0.743956	0.307286	-0.994265	0.502625	0.584837	-0.979022
16	D10	2014-07-19	1948.0	-0.209415	-0.409542	-0.591509	-0.747244	-0.869842	-0.953865	-0.995589
17	D10	2014-07-20	1943.0	0.999997	0.004934	-0.999973	-0.009867	0.999924	0.014800	-0.999851
18	D10	2014-07-21	1933.3	-0.204588	0.400521	-0.579511	0.733985	-0.857409	0.944561	-0.991756
19	D10	2014-07-22	1891.0	-0.915297	-0.737326	0.321336	0.996182	0.481148	-0.608588	-0.971403

6 More Coming Soon…

We are in the early stages of development. But it’s obvious the potential for pytimetk now in Python. 🐍

Please ⭐ us on GitHub (it takes 2-seconds and means a lot).
To make requests, please see our Project Roadmap GH Issue #2. You can make requests there.
Want to contribute? See our contributing guide here.