Adding Features (Augmenting)

This section will cover the augment set of functions, use to add many additional time series features to a dataset. We’ll cover how to use the following set of functions

1 Augment Lags / Leads

Lags are commonly used in time series forecasting to incorportate the past values of a feature as predictors. Leads, while not as common as Lags in time series might be useful in scenarios where you want to predict a future value based on other future values.

Help Doc Info: augment_lag(), augment_leads()

Use help(tk.augment_lags) and help(tk.augment_leads) to review additional helpful documentation.

1.1 Basic Examples

Add 1 or more lags / leads to a dataset:

Code
# import libraries
import pytimetk as tk
import pandas as pd
import numpy as np
import random

# create sample data
dates = pd.date_range(start = '2023-09-18', end = '2023-09-24')
values = [random.randint(10, 50) for _ in range(7)]

df = pd.DataFrame({
    'date': dates,
    'value': values
})

df
date value
0 2023-09-18 25
1 2023-09-19 50
2 2023-09-20 49
3 2023-09-21 45
4 2023-09-22 48
5 2023-09-23 18
6 2023-09-24 18

Create lag / lead of 3 days:

Code
# augment lag
df \
    .augment_lags(
        date_column  = 'date',
        value_column = 'value',
        lags         = 3
    )
date value value_lag_3
0 2023-09-18 25 NaN
1 2023-09-19 50 NaN
2 2023-09-20 49 NaN
3 2023-09-21 45 25.0
4 2023-09-22 48 50.0
5 2023-09-23 18 49.0
6 2023-09-24 18 45.0
Code
# augment leads
df \
    .augment_leads(
        date_column  = 'date',
        value_column = 'value',
        leads        = 3
    )
date value value_lead_3
0 2023-09-18 25 45.0
1 2023-09-19 50 48.0
2 2023-09-20 49 18.0
3 2023-09-21 45 18.0
4 2023-09-22 48 NaN
5 2023-09-23 18 NaN
6 2023-09-24 18 NaN

We can create multiple lag / lead values for a single time series:

Code
# multiple lagged values for a single time series
df \
    .augment_lags(
        date_column  = 'date',
        value_column = 'value',
        lags         = (1, 3)
    )
date value value_lag_1 value_lag_2 value_lag_3
0 2023-09-18 25 NaN NaN NaN
1 2023-09-19 50 25.0 NaN NaN
2 2023-09-20 49 50.0 25.0 NaN
3 2023-09-21 45 49.0 50.0 25.0
4 2023-09-22 48 45.0 49.0 50.0
5 2023-09-23 18 48.0 45.0 49.0
6 2023-09-24 18 18.0 48.0 45.0
Code
# multiple leads values for a single time series
df \
    .augment_leads(
        date_column  = 'date',
        value_column = 'value',
        leads        = (1, 3)
    )
date value value_lead_1 value_lead_2 value_lead_3
0 2023-09-18 25 50.0 49.0 45.0
1 2023-09-19 50 49.0 45.0 48.0
2 2023-09-20 49 45.0 48.0 18.0
3 2023-09-21 45 48.0 18.0 18.0
4 2023-09-22 48 18.0 18.0 NaN
5 2023-09-23 18 18.0 NaN NaN
6 2023-09-24 18 NaN NaN NaN

1.2 Augment Lags / Leads For Grouped Time Series

augment_lags() and augment_leads() also works for grouped time series data. Lets use the m4_daily_df dataset to showcase examples:

Code
# load m4_daily_df
m4_daily_df = tk.load_dataset('m4_daily', parse_dates = ['date'])
Code
# agument lags for grouped time series
m4_daily_df \
    .groupby("id") \
    .augment_lags(
        date_column  = 'date',
        value_column = 'value',
        lags         = (1, 7)
    )
id date value value_lag_1 value_lag_2 value_lag_3 value_lag_4 value_lag_5 value_lag_6 value_lag_7
0 D10 2014-07-03 2076.2 NaN NaN NaN NaN NaN NaN NaN
1 D10 2014-07-04 2073.4 2076.2 NaN NaN NaN NaN NaN NaN
2 D10 2014-07-05 2048.7 2073.4 2076.2 NaN NaN NaN NaN NaN
3 D10 2014-07-06 2048.9 2048.7 2073.4 2076.2 NaN NaN NaN NaN
4 D10 2014-07-07 2006.4 2048.9 2048.7 2073.4 2076.2 NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ...
9738 D500 2012-09-19 9418.8 9431.9 9437.7 9474.6 9359.2 9286.9 9265.4 9091.4
9739 D500 2012-09-20 9365.7 9418.8 9431.9 9437.7 9474.6 9359.2 9286.9 9265.4
9740 D500 2012-09-21 9445.9 9365.7 9418.8 9431.9 9437.7 9474.6 9359.2 9286.9
9741 D500 2012-09-22 9497.9 9445.9 9365.7 9418.8 9431.9 9437.7 9474.6 9359.2
9742 D500 2012-09-23 9545.3 9497.9 9445.9 9365.7 9418.8 9431.9 9437.7 9474.6

9743 rows Γ— 10 columns

Code
# augment leads for grouped time series
m4_daily_df \
    .groupby("id") \
    .augment_leads(
        date_column  = 'date',
        value_column = 'value',
        leads        = (1, 7)
    )
id date value value_lead_1 value_lead_2 value_lead_3 value_lead_4 value_lead_5 value_lead_6 value_lead_7
0 D10 2014-07-03 2076.2 2073.4 2048.7 2048.9 2006.4 2017.6 2019.1 2007.4
1 D10 2014-07-04 2073.4 2048.7 2048.9 2006.4 2017.6 2019.1 2007.4 2010.0
2 D10 2014-07-05 2048.7 2048.9 2006.4 2017.6 2019.1 2007.4 2010.0 2001.5
3 D10 2014-07-06 2048.9 2006.4 2017.6 2019.1 2007.4 2010.0 2001.5 1978.8
4 D10 2014-07-07 2006.4 2017.6 2019.1 2007.4 2010.0 2001.5 1978.8 1988.3
... ... ... ... ... ... ... ... ... ... ...
9738 D500 2012-09-19 9418.8 9365.7 9445.9 9497.9 9545.3 NaN NaN NaN
9739 D500 2012-09-20 9365.7 9445.9 9497.9 9545.3 NaN NaN NaN NaN
9740 D500 2012-09-21 9445.9 9497.9 9545.3 NaN NaN NaN NaN NaN
9741 D500 2012-09-22 9497.9 9545.3 NaN NaN NaN NaN NaN NaN
9742 D500 2012-09-23 9545.3 NaN NaN NaN NaN NaN NaN NaN

9743 rows Γ— 10 columns

2 Augment Rolling

A Rolling Window refers to a specific-sized subset of time series data that moves sequentially over the dataset.

Rolling windows play a crucial role in time series forecasting due to their ability to smooth out data, highlight seasonality, and detect anomalies.

augment_rolling() applies multiple rolling window functions with varying window sizes to time series data.

Help Doc Info: augment_rolling()

Use help(tk.augment_rolling) to review additional helpful documentation.

2.1 Basic Examples

We’ll continue with the use of our sample df created earlier:

Code
# window = 3 days, window function = mean
df \
    .augment_rolling(
        date_column  = 'date',
        value_column = 'value',
        window       = 3,
        window_func  = 'mean'
    )
date value value_rolling_mean_win_3
0 2023-09-18 25 NaN
1 2023-09-19 50 NaN
2 2023-09-20 49 41.333333
3 2023-09-21 45 48.000000
4 2023-09-22 48 47.333333
5 2023-09-23 18 37.000000
6 2023-09-24 18 28.000000

It is important to understand how the center parameter in augment_rolling() works.

center

When set to True (default) the value of the rolling window will be centered, meaning that the value at the center of the window will be used as the result. When set to False (default) the rolling window will not be centered, meaning that the value at the end of the window will be used as the result.

Lets see an example:

Code
# agument rolling: center = true
df \
    .augment_rolling(
        date_column  = 'date',
        value_column = 'value',
        window       = 3,
        window_func  = 'mean',
        center       = True
    )
date value value_rolling_mean_win_3
0 2023-09-18 25 NaN
1 2023-09-19 50 41.333333
2 2023-09-20 49 48.000000
3 2023-09-21 45 47.333333
4 2023-09-22 48 37.000000
5 2023-09-23 18 28.000000
6 2023-09-24 18 NaN

Note that we are using a 3 day rolling window and applying a mean to value. In simplier terms, value_rolling_mean_win_3 is a 3 day rolling average of value with center set to True. Thus the function starts computing the mean from 2023-09-19

Code
# agument rolling: center = false
df \
    .augment_rolling(
        date_column  = 'date',
        value_column = 'value',
        window       = 3,
        window_func  = 'mean',
        center       = False
    )
date value value_rolling_mean_win_3
0 2023-09-18 25 NaN
1 2023-09-19 50 NaN
2 2023-09-20 49 41.333333
3 2023-09-21 45 48.000000
4 2023-09-22 48 47.333333
5 2023-09-23 18 37.000000
6 2023-09-24 18 28.000000

Note that we are using a 3 day rolling window and applying a mean to value. In simplier terms, value_rolling_mean_win_3 is a 3 day rolling average of value with center set to False. Thus the function starts computing the mean from 2023-09-20. The same value for 2023-19-18 and 2023-09-19 are returned as value_rolling_mean_win_3 since it did not detected the third to apply the 3 day rolling average.

2.2 Augment Rolling with Multiple Windows and Window Functions

Multiple window functions can be passed to the window and window_func parameters:

Code
# augment rolling: window of 2 & 7 days, window_func of mean and standard deviation
m4_daily_df \
    .query('id == "D10"') \
    .augment_rolling(
                date_column = 'date',
                value_column = 'value',
                window = [2,7],
                window_func = ['mean', ('std', lambda x: x.std())]
            )
id date value value_rolling_mean_win_2 value_rolling_std_win_2 value_rolling_mean_win_7 value_rolling_std_win_7
0 D10 2014-07-03 2076.2 NaN NaN NaN NaN
1 D10 2014-07-04 2073.4 2074.80 1.40 2074.800000 1.400000
2 D10 2014-07-05 2048.7 2061.05 12.35 2066.100000 12.356645
3 D10 2014-07-06 2048.9 2048.80 0.10 2061.800000 13.037830
4 D10 2014-07-07 2006.4 2027.65 21.25 2050.720000 25.041038
... ... ... ... ... ... ... ...
669 D10 2016-05-02 2630.7 2615.85 14.85 2579.471429 28.868159
670 D10 2016-05-03 2649.3 2640.00 9.30 2594.800000 33.081631
671 D10 2016-05-04 2631.8 2640.55 8.75 2601.371429 35.145563
672 D10 2016-05-05 2622.5 2627.15 4.65 2607.457143 34.584508
673 D10 2016-05-06 2620.1 2621.30 1.20 2618.328571 22.923270

674 rows Γ— 7 columns

2.3 Augment Rolling with Grouped Time Series

agument_rolling can be used on grouped time series data:

Code
## augment rolling on grouped time series: window of 2 & 7 days, window_func of mean and standard deviation
m4_daily_df \
    .groupby('id') \
    .augment_rolling(
                date_column = 'date',
                value_column = 'value',
                window = [2,7],
                window_func = ['mean', ('std', lambda x: x.std())]
            )
id date value value_rolling_mean_win_2 value_rolling_std_win_2 value_rolling_mean_win_7 value_rolling_std_win_7
0 D10 2014-07-03 2076.2 NaN NaN NaN NaN
1 D10 2014-07-04 2073.4 2074.80 1.40 2074.800000 1.400000
2 D10 2014-07-05 2048.7 2061.05 12.35 2066.100000 12.356645
3 D10 2014-07-06 2048.9 2048.80 0.10 2061.800000 13.037830
4 D10 2014-07-07 2006.4 2027.65 21.25 2050.720000 25.041038
... ... ... ... ... ... ... ...
9738 D500 2012-09-19 9418.8 9425.35 6.55 9382.071429 74.335988
9739 D500 2012-09-20 9365.7 9392.25 26.55 9396.400000 58.431303
9740 D500 2012-09-21 9445.9 9405.80 40.10 9419.114286 39.184451
9741 D500 2012-09-22 9497.9 9471.90 26.00 9438.928571 38.945336
9742 D500 2012-09-23 9545.3 9521.60 23.70 9449.028571 53.379416

9743 rows Γ— 7 columns

3 Augment Time Series Signature

augment_timeseries_signature() is designed to assist in generating additional features from a given date column.

Help Doc Info: augment_timeseries_signature()

Use help(tk.augment_timeseries_signature) to review additional helpful documentation.

3.1 Basic Example

We’ll showcase an example using the m4_daily_df dataset by generating 29 additional features from the date column:

Code
# augment time series signature
m4_daily_df \
    .query('id == "D10"') \
    .augment_timeseries_signature(
        date_column = 'date'
    ) \
    .head()
id date value date_index_num date_year date_year_iso date_yearstart date_yearend date_leapyear date_half ... date_mday date_qday date_yday date_weekend date_hour date_minute date_second date_msecond date_nsecond date_am_pm
0 D10 2014-07-03 2076.2 1404345600 2014 2014 0 0 0 2 ... 3 3 184 0 0 0 0 0 0 am
1 D10 2014-07-04 2073.4 1404432000 2014 2014 0 0 0 2 ... 4 4 185 0 0 0 0 0 0 am
2 D10 2014-07-05 2048.7 1404518400 2014 2014 0 0 0 2 ... 5 5 186 0 0 0 0 0 0 am
3 D10 2014-07-06 2048.9 1404604800 2014 2014 0 0 0 2 ... 6 6 187 1 0 0 0 0 0 am
4 D10 2014-07-07 2006.4 1404691200 2014 2014 0 0 0 2 ... 7 7 188 0 0 0 0 0 0 am

5 rows Γ— 32 columns

4 Augment Holiday Signature

augment_holiday_signature() is used to flag holidays from a date column based on date and country.

Help Doc Info: augment_holiday_signature()

Use help(tk.augment_holiday_signature) to review additional helpful documentation.

4.1 Basic Example

We’ll showcase an example using some sample data:

Code
# create sample data
dates = pd.date_range(start = '2022-12-25', end = '2023-01-05')

df = pd.DataFrame({'date': dates})

# augment time series signature: USA
df \
    .augment_holiday_signature(
        date_column  = 'date',
        country_name = 'UnitedStates'
    )
date is_holiday before_holiday after_holiday holiday_name
0 2022-12-25 1 1 0 Christmas Day
1 2022-12-26 1 0 1 Christmas Day (Observed)
2 2022-12-27 0 0 1 NaN
3 2022-12-28 0 0 0 NaN
4 2022-12-29 0 0 0 NaN
5 2022-12-30 0 0 0 NaN
6 2022-12-31 0 1 0 NaN
7 2023-01-01 1 1 0 New Year's Day
8 2023-01-02 1 0 1 New Year's Day (Observed)
9 2023-01-03 0 0 1 NaN
10 2023-01-04 0 0 0 NaN
11 2023-01-05 0 0 0 NaN

5 Augment Fourier

augment_fourier() is used to add mutiple fourier series to time series data. Fourier transformation is often used as a feature engineering technique in time series forecasting as it helps detect hidden periodicities and cyclic patterns in the data. Capturing these hidden cyclic patterns can help improve predictive performance.

Help Doc Info: augment_fourier()

Use help(tk.augment_fourier) to review additional helpful documentation.

5.1 Basic Example

Code
# augment fourier with 7 periods and max order of 1
#m4_daily_df \
#    .query('id == "D10"') \
#    .augment_fourier(
#       date_column  = 'date',
#       value_column = 'value',
#       num_periods  = 7,
#       max_order    = 1
#    ) \
#   .head(20)

Notice the additional value_fourier_1_1 to value_fourier_1_7 colums that have been added to the data.

5.2 Augment Fourier with Grouped Time Series

augment_fourier also works with grouped time series:

Code
# augment fourier with grouped time series
m4_daily_df \
    .groupby('id') \
    .augment_fourier(
        date_column  = 'date',
        value_column = 'value',
        num_periods  = 7,
        max_order    = 1
    ) \
    .head(20)
id date value value_fourier_1_1 value_fourier_1_2 value_fourier_1_3 value_fourier_1_4 value_fourier_1_5 value_fourier_1_6 value_fourier_1_7
0 D10 2014-07-03 2076.2 0.394510 -0.725024 0.937927 -0.998682 0.897435 -0.650609 0.298243
1 D10 2014-07-04 2073.4 -0.980653 0.383931 0.830342 -0.709015 -0.552759 0.925423 0.190450
2 D10 2014-07-05 2048.7 0.011484 0.022967 0.034446 0.045921 0.057390 0.068852 0.080304
3 D10 2014-07-06 2048.9 0.975899 -0.425928 -0.790004 0.770723 0.453624 -0.968706 -0.030835
4 D10 2014-07-07 2006.4 -0.415510 0.755886 -0.959581 0.989762 -0.840972 0.540115 -0.141593
5 D10 2014-07-08 2017.6 -0.803876 -0.956286 -0.333715 0.559301 0.999055 0.629169 -0.250600
6 D10 2014-07-09 2019.1 0.748318 0.992779 0.568784 -0.238184 -0.884778 -0.935635 -0.356511
7 D10 2014-07-10 2007.4 0.494070 -0.859111 0.999790 -0.879368 0.529294 -0.040992 -0.458015
8 D10 2014-07-11 2010.0 -0.952864 0.578192 0.602021 -0.943494 -0.029515 0.961404 -0.553858
9 D10 2014-07-12 2001.5 -0.099581 -0.198171 -0.294792 -0.388482 -0.478310 -0.563384 -0.642856
10 D10 2014-07-13 1978.8 0.994091 -0.215816 -0.947238 0.421459 0.855740 -0.607239 -0.723909
11 D10 2014-07-14 1988.3 -0.311977 0.592812 -0.814472 0.954831 -0.999879 0.945118 -0.796015
12 D10 2014-07-15 2000.7 -0.864932 -0.868201 -0.006551 0.861625 0.871433 0.013101 -0.858282
13 D10 2014-07-16 2010.5 0.670062 0.994781 0.806801 0.203005 -0.505418 -0.953354 -0.909941
14 D10 2014-07-17 2014.5 0.587524 -0.950856 0.951356 -0.588831 0.001617 0.586214 -0.950354
15 D10 2014-07-18 1962.6 -0.913299 0.743956 0.307286 -0.994265 0.502625 0.584837 -0.979022
16 D10 2014-07-19 1948.0 -0.209415 -0.409542 -0.591509 -0.747244 -0.869842 -0.953865 -0.995589
17 D10 2014-07-20 1943.0 0.999997 0.004934 -0.999973 -0.009867 0.999924 0.014800 -0.999851
18 D10 2014-07-21 1933.3 -0.204588 0.400521 -0.579511 0.733985 -0.857409 0.944561 -0.991756
19 D10 2014-07-22 1891.0 -0.915297 -0.737326 0.321336 0.996182 0.481148 -0.608588 -0.971403

6 More Coming Soon…

We are in the early stages of development. But it’s obvious the potential for pytimetk now in Python. 🐍