Data Wrangling

This section will cover data wrangling for timeseries using pytimetk. We’ll show examples for the following functions:

summarize_by_time()
future_frame()
pad_by_time()

Perequisite

Before proceeding, be sure to review the Timetk Basics section if you haven’t already.

1 Summarize by Time

summarize_by_time() aggregates time series data from lower frequency (time periods) to higher frequency.

Load Libraries & Data

Code

# import libraries
import pytimetk as tk
import pandas as pd
import numpy as np

# import data
m4_daily_df = tk.load_dataset('m4_daily', parse_dates = ['date'])

print(m4_daily_df.head())
print('\nLength of the full dataset:', len(m4_daily_df))

    id       date   value
0  D10 2014-07-03  2076.2
1  D10 2014-07-04  2073.4
2  D10 2014-07-05  2048.7
3  D10 2014-07-06  2048.9
4  D10 2014-07-07  2006.4

Length of the full dataset: 9743

Help Doc Info: summarize_by_time

Use help(tk.summarize_by_time) to review additional helpful documentation.

1.1 Basic Example

The m4_daily dataset has a daily frequency. Say we are interested in forecasting at the weekly level. We can use summarize_by_time() to aggregate to a weekly level

Code

# summarize by time: daily to weekly
summarized_df = m4_daily_df \
    .summarize_by_time(
        date_column  = 'date',
        value_column = 'value',
        freq         = 'W',
        agg_func     = 'sum'
    )

print(summarized_df.head())
print('\nLength of the full dataset:', len(summarized_df))

        date     value
0 1978-06-25  27328.12
1 1978-07-02  63621.88
2 1978-07-09  63334.38
3 1978-07-16  63737.51
4 1978-07-23  64718.76

Length of the full dataset: 1977

The data has now been aggregated at the weekly level. Notice we now have 1977 rows, compared to full dataset which had 9743 rows.

1.2 Additional Aggregate Functions

summarize_by_time() can take additional aggregate functions in the agg_func argument.

Code

# summarize by time with additional aggregate functions
summarized_multiple_agg_df = m4_daily_df \
    .summarize_by_time(
        date_column  = 'date',
        value_column = 'value',
        freq         = 'W',
        agg_func     = ['sum', 'min', 'max']
    )

summarized_multiple_agg_df.head()

	date	value_sum	value_min	value_max
0	1978-06-25	27328.12	9103.12	9115.62
1	1978-07-02	63621.88	9046.88	9115.62
2	1978-07-09	63334.38	9028.12	9096.88
3	1978-07-16	63737.51	9075.00	9146.88
4	1978-07-23	64718.76	9171.88	9315.62

1.3 Summarize by Time with Grouped Time Series

summarize_by_time() also works with groups.

Code

# summarize by time with groups and additional aggregate functions
grouped_summarized_df = (
    m4_daily_df
        .groupby('id')
        .summarize_by_time(
            date_column  = 'date',
            value_column = 'value',
            freq         = 'W',
            agg_func     = [
                'sum',
                'min',
                ('q25', lambda x: np.quantile(x, 0.25)),
                'median',
                ('q75', lambda x: np.quantile(x, 0.75)),
                'max'
            ],
        )
)

grouped_summarized_df.head()

	id	date	value_sum	value_min	value_q25	value_median	value_q75	value_max
0	D10	2014-07-06	8247.2	2048.7	2048.85	2061.15	2074.10	2076.2
1	D10	2014-07-13	14040.8	1978.8	2003.95	2007.40	2013.80	2019.1
2	D10	2014-07-20	13867.6	1943.0	1955.30	1988.30	2005.60	2014.5
3	D10	2014-07-27	13266.3	1876.0	1887.15	1891.00	1895.85	1933.3
4	D10	2014-08-03	13471.2	1886.2	1914.60	1920.00	1939.55	1956.7

2 Future Frame

future_frame() can be used to extend timeseries data beyond the existing index (date). This is necessary when trying to make future predictions.

Help Doc Info: future_frame()

Use help(tk.future_frame) to review additional helpful documentation.

2.1 Basic Example

We’ll continue with our use of the m4_daily_df dataset. Recall we’ve alread aggregated at the weekly level (summarized_df). Lets checkout the last week in the summarized_df:

Code

# last week in dataset
summarized_df \
    .sort_values(by = 'date', ascending = True) \
    .iloc[: -1] \
    .tail(1)

	date	value
1975	2016-05-01	17959.8

iloc()

iloc[: -1] is used to filter out the last row and keep only dates that are the start of the week.

We can see that the last week is the week of 2016-05-01. Now say we wanted to forecast the next 8 weeks. We can extend the dataset beyound the week of 2016-05-01:

Code

# extend dataset by 12 weeks
summarized_extended_df = summarized_df \
    .future_frame(
        date_column = 'date',
        length_out  = 8
    )

summarized_extended_df

	date	value
0	1978-06-25	27328.12
1	1978-07-02	63621.88
2	1978-07-09	63334.38
3	1978-07-16	63737.51
4	1978-07-23	64718.76
...	...	...
1980	2016-06-05	NaN
1981	2016-06-12	NaN
1982	2016-06-19	NaN
1983	2016-06-26	NaN
1984	2016-07-03	NaN

1985 rows × 2 columns

To get only the future data, we can filter the dataset for where value is missing (np.nan).

Code

# get only future data
summarized_extended_df \
    .query('value.isna()')

	date	value
1977	2016-05-15	NaN
1978	2016-05-22	NaN
1979	2016-05-29	NaN
1980	2016-06-05	NaN
1981	2016-06-12	NaN
1982	2016-06-19	NaN
1983	2016-06-26	NaN
1984	2016-07-03	NaN

2.2 Future Frame with Grouped Time Series

future_frame() also works for grouped time series. We can see an example using our grouped summarized dataset (grouped_summarized_df) from earlier:

Code

# future frame with grouped time series
grouped_summarized_df[['id', 'date', 'value_sum']] \
    .groupby('id') \
    .future_frame(
        date_column = 'date',
        length_out  = 8
    ) \
    .query('value_sum.isna()') # filtering to return only the future data

	id	date	value_sum
1395	D10	2016-05-15	NaN
1396	D10	2016-05-22	NaN
1397	D10	2016-05-29	NaN
1398	D10	2016-06-05	NaN
1399	D10	2016-06-12	NaN
1400	D10	2016-06-19	NaN
1401	D10	2016-06-26	NaN
1402	D10	2016-07-03	NaN
1403	D160	2011-07-10	NaN
1404	D160	2011-07-17	NaN
1405	D160	2011-07-24	NaN
1406	D160	2011-07-31	NaN
1407	D160	2011-08-07	NaN
1408	D160	2011-08-14	NaN
1409	D160	2011-08-21	NaN
1410	D160	2011-08-28	NaN
1411	D410	1980-05-11	NaN
1412	D410	1980-05-18	NaN
1413	D410	1980-05-25	NaN
1414	D410	1980-06-01	NaN
1415	D410	1980-06-08	NaN
1416	D410	1980-06-15	NaN
1417	D410	1980-06-22	NaN
1418	D410	1980-06-29	NaN
1419	D500	2012-09-30	NaN
1420	D500	2012-10-07	NaN
1421	D500	2012-10-14	NaN
1422	D500	2012-10-21	NaN
1423	D500	2012-10-28	NaN
1424	D500	2012-11-04	NaN
1425	D500	2012-11-11	NaN
1426	D500	2012-11-18	NaN

3 Pad by Time

pad_by_time() can be used to add rows where timestamps are missing. For example, when working with sales data that may have missing values on weekends or holidays.

Help Doc Info: pad_by_time()

Use help(tk.pad_by_time) to review additional helpful documentation.

3.1 Basic Example

Let’s start with a basic example to see how pad_by_time() works. We’ll create some sample data with missing timestamps:

Code

# libraries
import pytimetk as tk
import pandas as pd
import numpy as np

# sample quarterly data with missing timestamp for Q3
dates = pd.to_datetime(["2021-01-01", "2021-04-01", "2021-10-01"])
value = range(len(dates))

df = pd.DataFrame({
    'date': dates,
    'value': range(len(dates))
})

df

	date	value
0	2021-01-01	0
1	2021-04-01	1
2	2021-10-01	2

Now we can use pad_by_time() to fill in the missing timestamp:

Code

# pad by time
df \
    .pad_by_time(
        date_column = 'date',
        freq        = 'QS' # specifying quarter start frequency
    )

	date	value
0	2021-01-01	0.0
1	2021-04-01	1.0
2	2021-07-01	NaN
3	2021-10-01	2.0

We can also specify shorter time frequency:

Code

# pad by time with shorter frequency
df \
    .pad_by_time(
        date_column = 'date',
        freq        = 'MS' # specifying month start frequency
    ) \
    .assign(value = lambda x: x['value'].fillna(0)) # replace NaN with 0

	date	value
0	2021-01-01	0.0
1	2021-02-01	0.0
2	2021-03-01	0.0
3	2021-04-01	1.0
4	2021-05-01	0.0
5	2021-06-01	0.0
6	2021-07-01	0.0
7	2021-08-01	0.0
8	2021-09-01	0.0
9	2021-10-01	2.0

3.2 Pad by Time with Grouped Time Series

pad_by_time() can also be used with grouped time series. Let’s use the stocks_daily dataset to showcase an example:

Code

# load dataset
stocks_df = tk.load_dataset('stocks_daily', parse_dates = ['date'])

# pad by time
stocks_df \
    .groupby('symbol') \
    .pad_by_time(
        date_column = 'date',
        freq        = 'D'
    ) \
    .assign(id = lambda x: x['symbol'].ffill())

	symbol	date	open	high	low	close	volume	adjusted	id
0	AAPL	2013-01-02	19.779285	19.821428	19.343929	19.608213	560518000.0	16.791180	AAPL
1	AAPL	2013-01-03	19.567142	19.631071	19.321428	19.360714	352965200.0	16.579241	AAPL
2	AAPL	2013-01-04	19.177500	19.236786	18.779642	18.821428	594333600.0	16.117437	AAPL
3	AAPL	2013-01-05	NaN	NaN	NaN	NaN	NaN	NaN	AAPL
4	AAPL	2013-01-06	NaN	NaN	NaN	NaN	NaN	NaN	AAPL
...	...	...	...	...	...	...	...	...	...
23485	NVDA	2023-09-17	NaN	NaN	NaN	NaN	NaN	NaN	NVDA
23486	NVDA	2023-09-18	427.480011	442.420013	420.000000	439.660004	50027100.0	439.660004	NVDA
23487	NVDA	2023-09-19	438.329987	439.660004	430.019989	435.200012	37306400.0	435.200012	NVDA
23488	NVDA	2023-09-20	436.000000	439.029999	422.230011	422.390015	36710800.0	422.390015	NVDA
23489	NVDA	2023-09-21	415.829987	421.000000	409.799988	410.170013	44893000.0	410.170013	NVDA

23490 rows × 9 columns

To replace NaN with 0 in a dataframe with multiple columns:

Code

from functools import partial

# columns to replace NaN with 0
cols_to_fill = ['open', 'high', 'low', 'close', 'volume', 'adjusted']

# define a function to fillna
def fill_na_col(df, col):
    return df[col].fillna(0)

# pad by time and replace NaN with 0
stocks_df \
    .groupby('symbol') \
    .pad_by_time(
        date_column = 'date',
        freq        = 'D'
    ) \
    .assign(id = lambda x: x['symbol'].ffill()) \
    .assign(**{col: partial(fill_na_col, col=col) for col in cols_to_fill})

	symbol	date	open	high	low	close	volume	adjusted	id
0	AAPL	2013-01-02	19.779285	19.821428	19.343929	19.608213	560518000.0	16.791180	AAPL
1	AAPL	2013-01-03	19.567142	19.631071	19.321428	19.360714	352965200.0	16.579241	AAPL
2	AAPL	2013-01-04	19.177500	19.236786	18.779642	18.821428	594333600.0	16.117437	AAPL
3	AAPL	2013-01-05	0.000000	0.000000	0.000000	0.000000	0.0	0.000000	AAPL
4	AAPL	2013-01-06	0.000000	0.000000	0.000000	0.000000	0.0	0.000000	AAPL
...	...	...	...	...	...	...	...	...	...
23485	NVDA	2023-09-17	0.000000	0.000000	0.000000	0.000000	0.0	0.000000	NVDA
23486	NVDA	2023-09-18	427.480011	442.420013	420.000000	439.660004	50027100.0	439.660004	NVDA
23487	NVDA	2023-09-19	438.329987	439.660004	430.019989	435.200012	37306400.0	435.200012	NVDA
23488	NVDA	2023-09-20	436.000000	439.029999	422.230011	422.390015	36710800.0	422.390015	NVDA
23489	NVDA	2023-09-21	415.829987	421.000000	409.799988	410.170013	44893000.0	410.170013	NVDA

23490 rows × 9 columns

4 Next Steps

Check out the Adding Features (Augmenting) Time Series Data Guide next.

5 More Coming Soon…

We are in the early stages of development. But it’s obvious the potential for pytimetk now in Python. 🐍

Please ⭐ us on GitHub (it takes 2-seconds and means a lot).
To make requests, please see our Project Roadmap GH Issue #2. You can make requests there.
Want to contribute? See our contributing guide here.