Data Wrangling

This section will cover data wrangling for timeseries using pytimetk. We’ll show examples for the following functions:

Perequisite

Before proceeding, be sure to review the Timetk Basics section if you haven’t already.

1 Summarize by Time

summarize_by_time() aggregates time series data from lower frequency (time periods) to higher frequency.

Load Libraries & Data

Code
# import libraries
import pytimetk as tk
import pandas as pd
import numpy as np

# import data
m4_daily_df = tk.load_dataset('m4_daily', parse_dates = ['date'])

print(m4_daily_df.head())
print('\nLength of the full dataset:', len(m4_daily_df))
    id       date   value
0  D10 2014-07-03  2076.2
1  D10 2014-07-04  2073.4
2  D10 2014-07-05  2048.7
3  D10 2014-07-06  2048.9
4  D10 2014-07-07  2006.4

Length of the full dataset: 9743
Help Doc Info: summarize_by_time

Use help(tk.summarize_by_time) to review additional helpful documentation.

1.1 Basic Example

The m4_daily dataset has a daily frequency. Say we are interested in forecasting at the weekly level. We can use summarize_by_time() to aggregate to a weekly level

Code
# summarize by time: daily to weekly
summarized_df = m4_daily_df \
    .summarize_by_time(
        date_column  = 'date',
        value_column = 'value',
        freq         = 'W',
        agg_func     = 'sum'
    )

print(summarized_df.head())
print('\nLength of the full dataset:', len(summarized_df))
        date     value
0 1978-06-25  27328.12
1 1978-07-02  63621.88
2 1978-07-09  63334.38
3 1978-07-16  63737.51
4 1978-07-23  64718.76

Length of the full dataset: 1977

The data has now been aggregated at the weekly level. Notice we now have 1977 rows, compared to full dataset which had 9743 rows.

1.2 Additional Aggregate Functions

summarize_by_time() can take additional aggregate functions in the agg_func argument.

Code
# summarize by time with additional aggregate functions
summarized_multiple_agg_df = m4_daily_df \
    .summarize_by_time(
        date_column  = 'date',
        value_column = 'value',
        freq         = 'W',
        agg_func     = ['sum', 'min', 'max']
    )

summarized_multiple_agg_df.head()
date value_sum value_min value_max
0 1978-06-25 27328.12 9103.12 9115.62
1 1978-07-02 63621.88 9046.88 9115.62
2 1978-07-09 63334.38 9028.12 9096.88
3 1978-07-16 63737.51 9075.00 9146.88
4 1978-07-23 64718.76 9171.88 9315.62

1.3 Summarize by Time with Grouped Time Series

summarize_by_time() also works with groups.

Code
# summarize by time with groups and additional aggregate functions
grouped_summarized_df = (
    m4_daily_df
        .groupby('id')
        .summarize_by_time(
            date_column  = 'date',
            value_column = 'value',
            freq         = 'W',
            agg_func     = [
                'sum',
                'min',
                ('q25', lambda x: np.quantile(x, 0.25)),
                'median',
                ('q75', lambda x: np.quantile(x, 0.75)),
                'max'
            ],
        )
)

grouped_summarized_df.head()
id date value_sum value_min value_q25 value_median value_q75 value_max
0 D10 2014-07-06 8247.2 2048.7 2048.85 2061.15 2074.10 2076.2
1 D10 2014-07-13 14040.8 1978.8 2003.95 2007.40 2013.80 2019.1
2 D10 2014-07-20 13867.6 1943.0 1955.30 1988.30 2005.60 2014.5
3 D10 2014-07-27 13266.3 1876.0 1887.15 1891.00 1895.85 1933.3
4 D10 2014-08-03 13471.2 1886.2 1914.60 1920.00 1939.55 1956.7

2 Future Frame

future_frame() can be used to extend timeseries data beyond the existing index (date). This is necessary when trying to make future predictions.

Help Doc Info: future_frame()

Use help(tk.future_frame) to review additional helpful documentation.

2.1 Basic Example

We’ll continue with our use of the m4_daily_df dataset. Recall we’ve alread aggregated at the weekly level (summarized_df). Lets checkout the last week in the summarized_df:

Code
# last week in dataset
summarized_df \
    .sort_values(by = 'date', ascending = True) \
    .iloc[: -1] \
    .tail(1)
date value
1975 2016-05-01 17959.8
iloc()

iloc[: -1] is used to filter out the last row and keep only dates that are the start of the week.

We can see that the last week is the week of 2016-05-01. Now say we wanted to forecast the next 8 weeks. We can extend the dataset beyound the week of 2016-05-01:

Code
# extend dataset by 12 weeks
summarized_extended_df = summarized_df \
    .future_frame(
        date_column = 'date',
        length_out  = 8
    )

summarized_extended_df
date value
0 1978-06-25 27328.12
1 1978-07-02 63621.88
2 1978-07-09 63334.38
3 1978-07-16 63737.51
4 1978-07-23 64718.76
... ... ...
1980 2016-06-05 NaN
1981 2016-06-12 NaN
1982 2016-06-19 NaN
1983 2016-06-26 NaN
1984 2016-07-03 NaN

1985 rows × 2 columns

To get only the future data, we can filter the dataset for where value is missing (np.nan).

Code
# get only future data
summarized_extended_df \
    .query('value.isna()')
date value
1977 2016-05-15 NaN
1978 2016-05-22 NaN
1979 2016-05-29 NaN
1980 2016-06-05 NaN
1981 2016-06-12 NaN
1982 2016-06-19 NaN
1983 2016-06-26 NaN
1984 2016-07-03 NaN

2.2 Future Frame with Grouped Time Series

future_frame() also works for grouped time series. We can see an example using our grouped summarized dataset (grouped_summarized_df) from earlier:

Code
# future frame with grouped time series
grouped_summarized_df[['id', 'date', 'value_sum']] \
    .groupby('id') \
    .future_frame(
        date_column = 'date',
        length_out  = 8
    ) \
    .query('value_sum.isna()') # filtering to return only the future data
id date value_sum
1395 D10 2016-05-15 NaN
1396 D10 2016-05-22 NaN
1397 D10 2016-05-29 NaN
1398 D10 2016-06-05 NaN
1399 D10 2016-06-12 NaN
1400 D10 2016-06-19 NaN
1401 D10 2016-06-26 NaN
1402 D10 2016-07-03 NaN
1403 D160 2011-07-10 NaN
1404 D160 2011-07-17 NaN
1405 D160 2011-07-24 NaN
1406 D160 2011-07-31 NaN
1407 D160 2011-08-07 NaN
1408 D160 2011-08-14 NaN
1409 D160 2011-08-21 NaN
1410 D160 2011-08-28 NaN
1411 D410 1980-05-11 NaN
1412 D410 1980-05-18 NaN
1413 D410 1980-05-25 NaN
1414 D410 1980-06-01 NaN
1415 D410 1980-06-08 NaN
1416 D410 1980-06-15 NaN
1417 D410 1980-06-22 NaN
1418 D410 1980-06-29 NaN
1419 D500 2012-09-30 NaN
1420 D500 2012-10-07 NaN
1421 D500 2012-10-14 NaN
1422 D500 2012-10-21 NaN
1423 D500 2012-10-28 NaN
1424 D500 2012-11-04 NaN
1425 D500 2012-11-11 NaN
1426 D500 2012-11-18 NaN

3 Pad by Time

pad_by_time() can be used to add rows where timestamps are missing. For example, when working with sales data that may have missing values on weekends or holidays.

Help Doc Info: pad_by_time()

Use help(tk.pad_by_time) to review additional helpful documentation.

3.1 Basic Example

Let’s start with a basic example to see how pad_by_time() works. We’ll create some sample data with missing timestamps:

Code
# libraries
import pytimetk as tk
import pandas as pd
import numpy as np

# sample quarterly data with missing timestamp for Q3
dates = pd.to_datetime(["2021-01-01", "2021-04-01", "2021-10-01"])
value = range(len(dates))

df = pd.DataFrame({
    'date': dates,
    'value': range(len(dates))
})

df
date value
0 2021-01-01 0
1 2021-04-01 1
2 2021-10-01 2

Now we can use pad_by_time() to fill in the missing timestamp:

Code
# pad by time
df \
    .pad_by_time(
        date_column = 'date',
        freq        = 'QS' # specifying quarter start frequency
    )
date value
0 2021-01-01 0.0
1 2021-04-01 1.0
2 2021-07-01 NaN
3 2021-10-01 2.0

We can also specify shorter time frequency:

Code
# pad by time with shorter frequency
df \
    .pad_by_time(
        date_column = 'date',
        freq        = 'MS' # specifying month start frequency
    ) \
    .assign(value = lambda x: x['value'].fillna(0)) # replace NaN with 0
date value
0 2021-01-01 0.0
1 2021-02-01 0.0
2 2021-03-01 0.0
3 2021-04-01 1.0
4 2021-05-01 0.0
5 2021-06-01 0.0
6 2021-07-01 0.0
7 2021-08-01 0.0
8 2021-09-01 0.0
9 2021-10-01 2.0

3.2 Pad by Time with Grouped Time Series

pad_by_time() can also be used with grouped time series. Let’s use the stocks_daily dataset to showcase an example:

Code
# load dataset
stocks_df = tk.load_dataset('stocks_daily', parse_dates = ['date'])

# pad by time
stocks_df \
    .groupby('symbol') \
    .pad_by_time(
        date_column = 'date',
        freq        = 'D'
    ) \
    .assign(id = lambda x: x['symbol'].ffill())
symbol date open high low close volume adjusted id
0 AAPL 2013-01-02 19.779285 19.821428 19.343929 19.608213 560518000.0 16.791180 AAPL
1 AAPL 2013-01-03 19.567142 19.631071 19.321428 19.360714 352965200.0 16.579241 AAPL
2 AAPL 2013-01-04 19.177500 19.236786 18.779642 18.821428 594333600.0 16.117437 AAPL
3 AAPL 2013-01-05 NaN NaN NaN NaN NaN NaN AAPL
4 AAPL 2013-01-06 NaN NaN NaN NaN NaN NaN AAPL
... ... ... ... ... ... ... ... ... ...
23485 NVDA 2023-09-17 NaN NaN NaN NaN NaN NaN NVDA
23486 NVDA 2023-09-18 427.480011 442.420013 420.000000 439.660004 50027100.0 439.660004 NVDA
23487 NVDA 2023-09-19 438.329987 439.660004 430.019989 435.200012 37306400.0 435.200012 NVDA
23488 NVDA 2023-09-20 436.000000 439.029999 422.230011 422.390015 36710800.0 422.390015 NVDA
23489 NVDA 2023-09-21 415.829987 421.000000 409.799988 410.170013 44893000.0 410.170013 NVDA

23490 rows × 9 columns

To replace NaN with 0 in a dataframe with multiple columns:

Code
from functools import partial

# columns to replace NaN with 0
cols_to_fill = ['open', 'high', 'low', 'close', 'volume', 'adjusted']

# define a function to fillna
def fill_na_col(df, col):
    return df[col].fillna(0)

# pad by time and replace NaN with 0
stocks_df \
    .groupby('symbol') \
    .pad_by_time(
        date_column = 'date',
        freq        = 'D'
    ) \
    .assign(id = lambda x: x['symbol'].ffill()) \
    .assign(**{col: partial(fill_na_col, col=col) for col in cols_to_fill})
symbol date open high low close volume adjusted id
0 AAPL 2013-01-02 19.779285 19.821428 19.343929 19.608213 560518000.0 16.791180 AAPL
1 AAPL 2013-01-03 19.567142 19.631071 19.321428 19.360714 352965200.0 16.579241 AAPL
2 AAPL 2013-01-04 19.177500 19.236786 18.779642 18.821428 594333600.0 16.117437 AAPL
3 AAPL 2013-01-05 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 AAPL
4 AAPL 2013-01-06 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 AAPL
... ... ... ... ... ... ... ... ... ...
23485 NVDA 2023-09-17 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 NVDA
23486 NVDA 2023-09-18 427.480011 442.420013 420.000000 439.660004 50027100.0 439.660004 NVDA
23487 NVDA 2023-09-19 438.329987 439.660004 430.019989 435.200012 37306400.0 435.200012 NVDA
23488 NVDA 2023-09-20 436.000000 439.029999 422.230011 422.390015 36710800.0 422.390015 NVDA
23489 NVDA 2023-09-21 415.829987 421.000000 409.799988 410.170013 44893000.0 410.170013 NVDA

23490 rows × 9 columns

4 Next Steps

Check out the Adding Features (Augmenting) Time Series Data Guide next.

5 More Coming Soon…

We are in the early stages of development. But it’s obvious the potential for pytimetk now in Python. 🐍