This section will cover data wrangling for timeseries using pytimetk. We’ll show examples for the following functions:
summarize_by_time()
future_frame()
pad_by_time()
Perequisite
Before proceeding, be sure to review the Timetk Basics section if you haven’t already.
1 Summarize by Time
summarize_by_time() aggregates time series data from lower frequency (time periods) to higher frequency.
Load Libraries & Data
Code
# import librariesimport pytimetk as tkimport pandas as pdimport numpy as np# import datam4_daily_df = tk.load_dataset('m4_daily', parse_dates = ['date'])print(m4_daily_df.head())print('\nLength of the full dataset:', len(m4_daily_df))
id date value
0 D10 2014-07-03 2076.2
1 D10 2014-07-04 2073.4
2 D10 2014-07-05 2048.7
3 D10 2014-07-06 2048.9
4 D10 2014-07-07 2006.4
Length of the full dataset: 9743
Help Doc Info: summarize_by_time
Use help(tk.summarize_by_time) to review additional helpful documentation.
1.1 Basic Example
The m4_daily dataset has a daily frequency. Say we are interested in forecasting at the weekly level. We can use summarize_by_time() to aggregate to a weekly level
Code
# summarize by time: daily to weeklysummarized_df = m4_daily_df \ .summarize_by_time( date_column ='date', value_column ='value', freq ='W', agg_func ='sum' )print(summarized_df.head())print('\nLength of the full dataset:', len(summarized_df))
date value
0 1978-06-25 27328.12
1 1978-07-02 63621.88
2 1978-07-09 63334.38
3 1978-07-16 63737.51
4 1978-07-23 64718.76
Length of the full dataset: 1977
The data has now been aggregated at the weekly level. Notice we now have 1977 rows, compared to full dataset which had 9743 rows.
1.2 Additional Aggregate Functions
summarize_by_time() can take additional aggregate functions in the agg_func argument.
Code
# summarize by time with additional aggregate functionssummarized_multiple_agg_df = m4_daily_df \ .summarize_by_time( date_column ='date', value_column ='value', freq ='W', agg_func = ['sum', 'min', 'max'] )summarized_multiple_agg_df.head()
date
value_sum
value_min
value_max
0
1978-06-25
27328.12
9103.12
9115.62
1
1978-07-02
63621.88
9046.88
9115.62
2
1978-07-09
63334.38
9028.12
9096.88
3
1978-07-16
63737.51
9075.00
9146.88
4
1978-07-23
64718.76
9171.88
9315.62
1.3 Summarize by Time with Grouped Time Series
summarize_by_time() also works with groups.
Code
# summarize by time with groups and additional aggregate functionsgrouped_summarized_df = ( m4_daily_df .groupby('id') .summarize_by_time( date_column ='date', value_column ='value', freq ='W', agg_func = ['sum','min', ('q25', lambda x: np.quantile(x, 0.25)),'median', ('q75', lambda x: np.quantile(x, 0.75)),'max' ], ))grouped_summarized_df.head()
id
date
value_sum
value_min
value_q25
value_median
value_q75
value_max
0
D10
2014-07-06
8247.2
2048.7
2048.85
2061.15
2074.10
2076.2
1
D10
2014-07-13
14040.8
1978.8
2003.95
2007.40
2013.80
2019.1
2
D10
2014-07-20
13867.6
1943.0
1955.30
1988.30
2005.60
2014.5
3
D10
2014-07-27
13266.3
1876.0
1887.15
1891.00
1895.85
1933.3
4
D10
2014-08-03
13471.2
1886.2
1914.60
1920.00
1939.55
1956.7
2 Future Frame
future_frame() can be used to extend timeseries data beyond the existing index (date). This is necessary when trying to make future predictions.
Help Doc Info: future_frame()
Use help(tk.future_frame) to review additional helpful documentation.
2.1 Basic Example
We’ll continue with our use of the m4_daily_df dataset. Recall we’ve alread aggregated at the weekly level (summarized_df). Lets checkout the last week in the summarized_df:
Code
# last week in datasetsummarized_df \ .sort_values(by ='date', ascending =True) \ .iloc[: -1] \ .tail(1)
date
value
1975
2016-05-01
17959.8
iloc()
iloc[: -1] is used to filter out the last row and keep only dates that are the start of the week.
We can see that the last week is the week of 2016-05-01. Now say we wanted to forecast the next 8 weeks. We can extend the dataset beyound the week of 2016-05-01:
To get only the future data, we can filter the dataset for where value is missing (np.nan).
Code
# get only future datasummarized_extended_df \ .query('value.isna()')
date
value
1977
2016-05-15
NaN
1978
2016-05-22
NaN
1979
2016-05-29
NaN
1980
2016-06-05
NaN
1981
2016-06-12
NaN
1982
2016-06-19
NaN
1983
2016-06-26
NaN
1984
2016-07-03
NaN
2.2 Future Frame with Grouped Time Series
future_frame() also works for grouped time series. We can see an example using our grouped summarized dataset (grouped_summarized_df) from earlier:
Code
# future frame with grouped time seriesgrouped_summarized_df[['id', 'date', 'value_sum']] \ .groupby('id') \ .future_frame( date_column ='date', length_out =8 ) \ .query('value_sum.isna()') # filtering to return only the future data
id
date
value_sum
1395
D10
2016-05-15
NaN
1396
D10
2016-05-22
NaN
1397
D10
2016-05-29
NaN
1398
D10
2016-06-05
NaN
1399
D10
2016-06-12
NaN
1400
D10
2016-06-19
NaN
1401
D10
2016-06-26
NaN
1402
D10
2016-07-03
NaN
1403
D160
2011-07-10
NaN
1404
D160
2011-07-17
NaN
1405
D160
2011-07-24
NaN
1406
D160
2011-07-31
NaN
1407
D160
2011-08-07
NaN
1408
D160
2011-08-14
NaN
1409
D160
2011-08-21
NaN
1410
D160
2011-08-28
NaN
1411
D410
1980-05-11
NaN
1412
D410
1980-05-18
NaN
1413
D410
1980-05-25
NaN
1414
D410
1980-06-01
NaN
1415
D410
1980-06-08
NaN
1416
D410
1980-06-15
NaN
1417
D410
1980-06-22
NaN
1418
D410
1980-06-29
NaN
1419
D500
2012-09-30
NaN
1420
D500
2012-10-07
NaN
1421
D500
2012-10-14
NaN
1422
D500
2012-10-21
NaN
1423
D500
2012-10-28
NaN
1424
D500
2012-11-04
NaN
1425
D500
2012-11-11
NaN
1426
D500
2012-11-18
NaN
3 Pad by Time
pad_by_time() can be used to add rows where timestamps are missing. For example, when working with sales data that may have missing values on weekends or holidays.
Help Doc Info: pad_by_time()
Use help(tk.pad_by_time) to review additional helpful documentation.
3.1 Basic Example
Let’s start with a basic example to see how pad_by_time() works. We’ll create some sample data with missing timestamps:
Code
# librariesimport pytimetk as tkimport pandas as pdimport numpy as np# sample quarterly data with missing timestamp for Q3dates = pd.to_datetime(["2021-01-01", "2021-04-01", "2021-10-01"])value =range(len(dates))df = pd.DataFrame({'date': dates,'value': range(len(dates))})df
date
value
0
2021-01-01
0
1
2021-04-01
1
2
2021-10-01
2
Now we can use pad_by_time() to fill in the missing timestamp:
Code
# pad by timedf \ .pad_by_time( date_column ='date', freq ='QS'# specifying quarter start frequency )
date
value
0
2021-01-01
0.0
1
2021-04-01
1.0
2
2021-07-01
NaN
3
2021-10-01
2.0
We can also specify shorter time frequency:
Code
# pad by time with shorter frequencydf \ .pad_by_time( date_column ='date', freq ='MS'# specifying month start frequency ) \ .assign(value =lambda x: x['value'].fillna(0)) # replace NaN with 0
date
value
0
2021-01-01
0.0
1
2021-02-01
0.0
2
2021-03-01
0.0
3
2021-04-01
1.0
4
2021-05-01
0.0
5
2021-06-01
0.0
6
2021-07-01
0.0
7
2021-08-01
0.0
8
2021-09-01
0.0
9
2021-10-01
2.0
3.2 Pad by Time with Grouped Time Series
pad_by_time() can also be used with grouped time series. Let’s use the stocks_daily dataset to showcase an example:
To replace NaN with 0 in a dataframe with multiple columns:
Code
from functools import partial# columns to replace NaN with 0cols_to_fill = ['open', 'high', 'low', 'close', 'volume', 'adjusted']# define a function to fillnadef fill_na_col(df, col):return df[col].fillna(0)# pad by time and replace NaN with 0stocks_df \ .groupby('symbol') \ .pad_by_time( date_column ='date', freq ='D' ) \ .assign(id=lambda x: x['symbol'].ffill()) \ .assign(**{col: partial(fill_na_col, col=col) for col in cols_to_fill})