PyTimeTK Basics

PyTimeTK has one mission: To make time series analysis simpler, easier, and faster in Python. This goal requires some opinionated ways of treating time series in Python. We will conceptually lay out how pytimetk can help.

How this guide benefits you

This guide covers how to use pytimetk conceptually. Once you understand key concepts, you can go from basic to advanced time series analysis very fast.

Let’s first start with how to think about time series data conceptually. Time series data has 3 core properties.

1 The 3 Core Properties of Time Series Data

Every time series DataFrame should have the following properties:

Time Series Index: A column containing ‘datetime64’ time stamps.
Value Columns: One or more columns containing numeric data that can be aggregated and visualized by time
Group Columns (Optional): One or more categorical or str columns that can be grouped by and time series can be evaluated by groups.

In practice here’s what this looks like using the “m4_daily” dataset:

Code

# Import packages
import pytimetk as tk
import pandas as pd
import numpy as np

# Import a Time Series Data Set
m4_daily_df = tk.load_dataset("m4_daily", parse_dates = ['date'])
m4_daily_df

	id	date	value
0	D10	2014-07-03	2076.2
1	D10	2014-07-04	2073.4
2	D10	2014-07-05	2048.7
3	D10	2014-07-06	2048.9
4	D10	2014-07-07	2006.4
...	...	...	...
9738	D500	2012-09-19	9418.8
9739	D500	2012-09-20	9365.7
9740	D500	2012-09-21	9445.9
9741	D500	2012-09-22	9497.9
9742	D500	2012-09-23	9545.3

9743 rows × 3 columns

(Example: m4_daily dataset) 3 Core Properties of Time Series Data

We can see that the m4_daily dataset has:

Time Series Index: The date column
Value Column(s): The value column
Group Column(s): The id column

Missing any of the 3 Core Properties of Time Series Data

If your data is not formatted properly for pytimetk, meaning it’s missing columns containing datetime, numeric values, or grouping columns, this can impact your ability to use pytimetk for time series anlysis.

No Pandas Index, No Problem

Timetk standardizes using a date column. This is to reduce friction in converting to other package formats like polars, which don’t use an an index (each row is indexed by its integer position).

2 The 2 Ways that Timetk Makes Time Series Analysis Easier

2 Types of Time Series Functions

Pandas DataFrame Operations
Pandas Series Operations

Timetk contains a number of functions designed to make time series analysis operations easier. In general, these operations come in 2 types of time series functions:

Pandas DataFrame Operations: These functions work on pd.DataFrame objects and derivatives such as groupby() objects for Grouped Time Series Analysis. You will see data as the first parameter in these functions.
Pandas Series Operations: These functions work on pd.Series objects.
- Time Series Index Operations: Are designed for Time Series index. You will see idx as the first parameter of these functions. In these cases, these functions also work with datetime64 values (e.g. those produced when you parse_dates via pd.read_csv() or create time series with pd.date_range())
- Numeric Operations: Are designed for Numeric Values. You will see x as the first parameter for these functions.

Let’s take a look at how to use the different types of Time Series Analysis functions in pytimetk. We’ll start with Type 1: Pandas DataFrame Operations.

2.1 Type 1: Pandas DataFrame Operations

Before we start using pytimetk, let’s make sure our data is set up properly.

Timetk Data Format Compliance

3 Core Properties Must Be Upheald

A pytimetk-Compliant Pandas DataFrame must have:

Time Series Index: A Time Stamp column containing datetime64 values
Value Column(s): The value column(s) containing float or int values
Group Column(s): Optionally for grouped time series analysis, one or more columns containg str or categorical values (shown as an object)

If these are NOT upheld, this will impact your ability to use pytimetk DataFrame operations.

Inspect the DataFrame

Use the tk.glimpse() method to check compliance.

Using pytimetk glimpse() method, we can see that we have a compliant data frame with a date column containing datetime64 and a value column containing float64. For grouped analysis we have the id column containing object dtype.

Code

# Tip: Inspect for compliance with glimpse()
m4_daily_df.glimpse()

<class 'pandas.core.frame.DataFrame'>: 9743 rows of 3 columns
id:     object            ['D10', 'D10', 'D10', 'D10', 'D10', 'D10', 'D1 ...
date:   datetime64[ns]    [Timestamp('2014-07-03 00:00:00'), Timestamp(' ...
value:  float64           [2076.2, 2073.4, 2048.7, 2048.9, 2006.4, 2017. ...

Grouped Time Series Analysis with Summarize By Time

First, inspect how the summarize_by_time function works by calling help().

Code

# Review the summarize_by_time documentation (output not shown)
help(tk.summarize_by_time)

Help Doc Info: summarize_by_time()

The first parameter is data, indicating this is a DataFrame operation.
The Examples show different use cases for how to apply the function on a DataFrame

Let’s test the summarize_by_time() DataFrame operation out using the grouped approach with method chaining. DataFrame operations can be used as Pandas methods with method-chaining, which allows us to more succinctly apply time series operations.

Code

# Grouped Summarize By Time with Method Chaining
df_summarized = (
    m4_daily_df
        .groupby('id')
        .summarize_by_time(
            date_column  = 'date',
            value_column = 'value',
            freq         = 'QS', # QS = Quarter Start
            agg_func     = [
                'mean', 
                'median', 
                'min',
                ('q25', lambda x: np.quantile(x, 0.25)),
                ('q75', lambda x: np.quantile(x, 0.75)),
                'max',
                ('range',lambda x: x.max() - x.min()),
            ],
        )
)

df_summarized

	id	date	value_mean	value_median	value_min	value_q25	value_q75	value_max	value_range
0	D10	2014-07-01	1960.078889	1979.90	1781.6	1915.225	2002.575	2076.2	294.6
1	D10	2014-10-01	2184.586957	2154.05	2022.8	2125.075	2274.150	2344.9	322.1
2	D10	2015-01-01	2309.830000	2312.30	2209.6	2284.575	2342.150	2392.4	182.8
3	D10	2015-04-01	2344.481319	2333.00	2185.1	2301.750	2391.000	2499.8	314.7
4	D10	2015-07-01	2156.754348	2186.70	1856.6	1997.250	2289.425	2368.1	511.5
...	...	...	...	...	...	...	...	...	...
105	D500	2011-07-01	9727.321739	9745.55	8964.5	9534.125	10003.900	10463.9	1499.4
106	D500	2011-10-01	8175.565217	7897.00	6755.0	7669.875	8592.575	9860.0	3105.0
107	D500	2012-01-01	8291.317582	8412.60	7471.5	7814.800	8677.850	8980.7	1509.2
108	D500	2012-04-01	8654.020879	8471.10	8245.6	8389.850	9017.250	9349.2	1103.6
109	D500	2012-07-01	8770.502353	8690.50	8348.1	8604.400	8846.000	9545.3	1197.2

110 rows × 9 columns

Key Takeaways: summarize_by_time()

The data must comply with the 3 core properties (date column, value column(s), and group column(s))
The aggregation functions were applied by combination of group (id) and resample (Quarter Start)
The result was a pandas DataFrame with group column, resampled date column, and summary values (mean, median, min, 25th-quantile, etc)

Another DataFrame Example: Creating 29 Engineered Features

Let’s examine another DataFrame function, tk.augment_timeseries_signature(). Feel free to inspect the documentation with help(tk.augment_timeseries_signature).

Code

# Creating 29 engineered features from the date column
# Not run: help(tk.augment_timeseries_signature)
df_augmented = (
    m4_daily_df
        .augment_timeseries_signature(date_column = 'date')
)

df_augmented.head()

	id	date	value	date_index_num	date_year	date_year_iso	date_half	...	date_mday	date_qday	date_yday	date_weekend	date_am_pm
0	D10	2014-07-03	2076.2	1404345600	2014	2014	2	...	3	3	184	0	am
1	D10	2014-07-04	2073.4	1404432000	2014	2014	2	...	4	4	185	0	am
2	D10	2014-07-05	2048.7	1404518400	2014	2014	2	...	5	5	186	0	am
3	D10	2014-07-06	2048.9	1404604800	2014	2014	2	...	6	6	187	1	am
4	D10	2014-07-07	2006.4	1404691200	2014	2014	2	...	7	7	188	0	am

5 rows × 32 columns

Key Takeaways: augment_timeseries_signature()

The data must comply with the 1 of the 3 core properties (date column)
The result was a pandas DataFrame with 29 time series features that can be used for Machine Learning and Forecasting

Making Future Dates with Future Frame

A common time series task before forecasting with machine learning models is to make a future DataFrame some length_out into the future. You can do this with tk.future_frame(). Here’s how.

Code

# Preparing a time series data set for Machine Learning Forecasting
full_augmented_df = (
    m4_daily_df 
        .groupby('id')
        .future_frame('date', length_out = 365)
        .augment_timeseries_signature('date')
)
full_augmented_df

	id	date	value	date_index_num	date_year	date_year_iso	date_yearstart	date_yearend	date_leapyear	date_half	...	date_mday	date_qday	date_yday	date_weekend	date_hour	date_minute	date_second	date_msecond	date_nsecond	date_am_pm
0	D10	2014-07-03	2076.2	1404345600	2014	2014	0	0	0	2	...	3	3	184	0	0	0	0	0	0	am
1	D10	2014-07-04	2073.4	1404432000	2014	2014	0	0	0	2	...	4	4	185	0	0	0	0	0	0	am
2	D10	2014-07-05	2048.7	1404518400	2014	2014	0	0	0	2	...	5	5	186	0	0	0	0	0	0	am
3	D10	2014-07-06	2048.9	1404604800	2014	2014	0	0	0	2	...	6	6	187	1	0	0	0	0	0	am
4	D10	2014-07-07	2006.4	1404691200	2014	2014	0	0	0	2	...	7	7	188	0	0	0	0	0	0	am
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
11198	D500	2013-09-19	NaN	1379548800	2013	2013	0	0	0	2	...	19	81	262	0	0	0	0	0	0	am
11199	D500	2013-09-20	NaN	1379635200	2013	2013	0	0	0	2	...	20	82	263	0	0	0	0	0	0	am
11200	D500	2013-09-21	NaN	1379721600	2013	2013	0	0	0	2	...	21	83	264	0	0	0	0	0	0	am
11201	D500	2013-09-22	NaN	1379808000	2013	2013	0	0	0	2	...	22	84	265	1	0	0	0	0	0	am
11202	D500	2013-09-23	NaN	1379894400	2013	2013	0	0	0	2	...	23	85	266	0	0	0	0	0	0	am

11203 rows × 32 columns

We can then get the future data by keying in on the data with value column that is missing (np.nan).

Code

# Get the future data (just the observations that haven't happened yet)
future_df = (
    full_augmented_df
        .query('value.isna()')
)
future_df

	id	date	value	date_index_num	date_year	date_year_iso	date_yearstart	date_yearend	date_leapyear	date_half	...	date_mday	date_qday	date_yday	date_weekend	date_hour	date_minute	date_second	date_msecond	date_nsecond	date_am_pm
9743	D10	2016-05-07	NaN	1462579200	2016	2016	0	0	1	1	...	7	37	128	0	0	0	0	0	0	am
9744	D10	2016-05-08	NaN	1462665600	2016	2016	0	0	1	1	...	8	38	129	1	0	0	0	0	0	am
9745	D10	2016-05-09	NaN	1462752000	2016	2016	0	0	1	1	...	9	39	130	0	0	0	0	0	0	am
9746	D10	2016-05-10	NaN	1462838400	2016	2016	0	0	1	1	...	10	40	131	0	0	0	0	0	0	am
9747	D10	2016-05-11	NaN	1462924800	2016	2016	0	0	1	1	...	11	41	132	0	0	0	0	0	0	am
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
11198	D500	2013-09-19	NaN	1379548800	2013	2013	0	0	0	2	...	19	81	262	0	0	0	0	0	0	am
11199	D500	2013-09-20	NaN	1379635200	2013	2013	0	0	0	2	...	20	82	263	0	0	0	0	0	0	am
11200	D500	2013-09-21	NaN	1379721600	2013	2013	0	0	0	2	...	21	83	264	0	0	0	0	0	0	am
11201	D500	2013-09-22	NaN	1379808000	2013	2013	0	0	0	2	...	22	84	265	1	0	0	0	0	0	am
11202	D500	2013-09-23	NaN	1379894400	2013	2013	0	0	0	2	...	23	85	266	0	0	0	0	0	0	am

1460 rows × 32 columns

2.2 Type 2: Pandas Series Operations

The main difference between a DataFrame operation and a Series operation is that we are operating on an array of values from typically one of the following dtypes:

Timestamps (datetime64)
Numeric (float64 or int64)

The first argument of Series operations that operate on Timestamps will always be idx.

Let’s take a look at one shall we? We’ll start with a common action: Making future time series from an existing time series with a regular frequency.

The Make Future Time Series Function

Say we have a monthly sequence of timestamps. What if we want to create a forecast where we predict 12 months into the future? Well, we will need to create 12 future timestamps. Here’s how.

First create a pd.date_range() with dates starting at the beginning of each month.

Code

# Make a monthly date range
dates_dt = pd.date_range("2023-01", "2024-01", freq="MS")
dates_dt

DatetimeIndex(['2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01',
               '2023-05-01', '2023-06-01', '2023-07-01', '2023-08-01',
               '2023-09-01', '2023-10-01', '2023-11-01', '2023-12-01',
               '2024-01-01'],
              dtype='datetime64[ns]', freq='MS')

Next, use tk.make_future_timeseries() to create the next 12 timestamps in the sequence.

Pandas Series
DateTimeIndex

Code

# Pandas Series: Future Dates
future_series = pd.Series(dates_dt).make_future_timeseries(12)
future_series

0    2024-02-01
1    2024-03-01
2    2024-04-01
3    2024-05-01
4    2024-06-01
5    2024-07-01
6    2024-08-01
7    2024-09-01
8    2024-10-01
9    2024-11-01
10   2024-12-01
11   2025-01-01
dtype: datetime64[ns]

Code

# DateTimeIndex: Future Dates
future_dt = tk.make_future_timeseries(
    idx      = dates_dt,
    length_out = 12
)
future_dt

0    2024-02-01
1    2024-03-01
2    2024-04-01
3    2024-05-01
4    2024-06-01
5    2024-07-01
6    2024-08-01
7    2024-09-01
8    2024-10-01
9    2024-11-01
10   2024-12-01
11   2025-01-01
dtype: datetime64[ns]

We can combine the actual and future timestamps into one combined timeseries.

Code

# Combining the 2 series and resetting the index
combined_timeseries = (
    pd.concat(
        [pd.Series(dates_dt), pd.Series(future_dt)],
        axis=0
    )
        .reset_index(drop = True)
)

combined_timeseries

0    2023-01-01
1    2023-02-01
2    2023-03-01
3    2023-04-01
4    2023-05-01
5    2023-06-01
6    2023-07-01
7    2023-08-01
8    2023-09-01
9    2023-10-01
10   2023-11-01
11   2023-12-01
12   2024-01-01
13   2024-02-01
14   2024-03-01
15   2024-04-01
16   2024-05-01
17   2024-06-01
18   2024-07-01
19   2024-08-01
20   2024-09-01
21   2024-10-01
22   2024-11-01
23   2024-12-01
24   2025-01-01
dtype: datetime64[ns]

Next, we’ll take a look at how to go from an irregular time series to a regular time series.

Flooring Dates

An example is tk.floor_date, which is used to round down dates. See help(tk.floor_date).

Flooring dates is often used as part of a strategy to go from an irregular time series to regular by combining with an aggregation. Often summarize_by_time() is used (I’ll share why shortly). But conceptually, date flooring is the secret.

With Flooring
Without Flooring

Code

# Monthly flooring rounds dates down to 1st of the month
m4_daily_df['date'].floor_date(unit = "M")

0      2014-07-01
1      2014-07-01
2      2014-07-01
3      2014-07-01
4      2014-07-01
          ...    
9738   2014-07-01
9739   2014-07-01
9740   2014-07-01
9741   2014-07-01
9742   2014-07-01
Name: date, Length: 9743, dtype: datetime64[ns]

Code

# Before Flooring
m4_daily_df['date']

0      2014-07-03
1      2014-07-03
2      2014-07-03
3      2014-07-03
4      2014-07-03
          ...    
9738   2014-07-03
9739   2014-07-03
9740   2014-07-03
9741   2014-07-03
9742   2014-07-03
Name: date, Length: 9743, dtype: datetime64[ns]

This “date flooring” operation can be useful for creating date groupings.

Code

# Adding a date group with floor_date()
dates_grouped_by_month = (
    m4_daily_df
        .assign(date_group = lambda x: x['date'].floor_date("M"))
)

dates_grouped_by_month

	id	date	value	date_group
0	D10	2014-07-03	2076.2	2014-07-01
1	D10	2014-07-03	2073.4	2014-07-01
2	D10	2014-07-03	2048.7	2014-07-01
3	D10	2014-07-03	2048.9	2014-07-01
4	D10	2014-07-03	2006.4	2014-07-01
...	...	...	...	...
9738	D500	2014-07-03	9418.8	2014-07-01
9739	D500	2014-07-03	9365.7	2014-07-01
9740	D500	2014-07-03	9445.9	2014-07-01
9741	D500	2014-07-03	9497.9	2014-07-01
9742	D500	2014-07-03	9545.3	2014-07-01

9743 rows × 4 columns

We can then do grouped operations.

Code

# Example of a grouped operation with floored dates
summary_df = (
    dates_grouped_by_month
        .drop('date', axis=1) \
        .groupby(['id', 'date_group'])
        .mean() \
        .reset_index()
)

summary_df

	id	date_group	value
0	D10	2014-07-01	2261.606825
1	D160	2014-07-01	9243.155254
2	D410	2014-07-01	8259.786346
3	D500	2014-07-01	8287.728789

Of course for this operation, we can do it faster with summarize_by_time() (and it’s much more flexible).

Code

# Summarize by time is less code and more flexible
(
    m4_daily_df 
        .groupby('id')
        .summarize_by_time(
            'date', 'value', 
            freq = "MS",
            agg_func = ['mean', 'median', 'min', 'max']
        )
)

	id	date	value_mean	value_median	value_min	value_max
0	D10	2014-07-01	2261.606825	2302.30	1781.60	2649.30
1	D160	2014-07-01	9243.155254	10097.30	1734.90	19432.50
2	D410	2014-07-01	8259.786346	8382.81	6309.38	9540.62
3	D500	2014-07-01	8287.728789	7662.10	4172.10	14954.10

And that’s the core idea behind pytimetk, writing less code and getting more.

Next, let’s do one more function. The brother of augment_timeseries_signature()…

The Get Time Series Signature Function

This function takes a pandas Series or DateTimeIndex and returns a DataFrame containing the 29 engineered features.

Start with either a DateTimeIndex…

Code

timestamps_dt = pd.date_range("2023", "2024", freq = "D")
timestamps_dt

DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05', '2023-01-06', '2023-01-07', '2023-01-08',
               '2023-01-09', '2023-01-10',
               ...
               '2023-12-23', '2023-12-24', '2023-12-25', '2023-12-26',
               '2023-12-27', '2023-12-28', '2023-12-29', '2023-12-30',
               '2023-12-31', '2024-01-01'],
              dtype='datetime64[ns]', length=366, freq='D')

… Or a Pandas Series.

Code

timestamps_series = pd.Series(timestamps_dt)
timestamps_series

0     2023-01-01
1     2023-01-02
2     2023-01-03
3     2023-01-04
4     2023-01-05
         ...    
361   2023-12-28
362   2023-12-29
363   2023-12-30
364   2023-12-31
365   2024-01-01
Length: 366, dtype: datetime64[ns]

And you can use the pandas Series function, tk.get_timeseries_signature() to create 29 features from the date sequence.

Pandas Series
DateTimeIndex

Code

# Pandas series: get_timeseries_signature
timestamps_series.get_timeseries_signature()

	idx	idx_index_num	idx_year	idx_year_iso	idx_yearstart	idx_yearend	idx_leapyear	idx_half	idx_quarter	idx_quarteryear	...	idx_mday	idx_qday	idx_yday	idx_weekend	idx_hour	idx_minute	idx_second	idx_msecond	idx_nsecond	idx_am_pm
0	2023-01-01	1672531200	2023	2022	1	0	0	1	1	2023Q1	...	1	1	1	1	0	0	0	0	0	am
1	2023-01-02	1672617600	2023	2023	0	0	0	1	1	2023Q1	...	2	2	2	0	0	0	0	0	0	am
2	2023-01-03	1672704000	2023	2023	0	0	0	1	1	2023Q1	...	3	3	3	0	0	0	0	0	0	am
3	2023-01-04	1672790400	2023	2023	0	0	0	1	1	2023Q1	...	4	4	4	0	0	0	0	0	0	am
4	2023-01-05	1672876800	2023	2023	0	0	0	1	1	2023Q1	...	5	5	5	0	0	0	0	0	0	am
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
361	2023-12-28	1703721600	2023	2023	0	0	0	2	4	2023Q4	...	28	89	362	0	0	0	0	0	0	am
362	2023-12-29	1703808000	2023	2023	0	0	0	2	4	2023Q4	...	29	90	363	0	0	0	0	0	0	am
363	2023-12-30	1703894400	2023	2023	0	0	0	2	4	2023Q4	...	30	91	364	0	0	0	0	0	0	am
364	2023-12-31	1703980800	2023	2023	0	1	0	2	4	2023Q4	...	31	92	365	1	0	0	0	0	0	am
365	2024-01-01	1704067200	2024	2024	1	0	1	1	1	2024Q1	...	1	1	1	0	0	0	0	0	0	am

366 rows × 30 columns

Code

# DateTimeIndex: get_timeseries_signature
tk.get_timeseries_signature(timestamps_dt)

	idx	idx_index_num	idx_year	idx_year_iso	idx_yearstart	idx_yearend	idx_leapyear	idx_half	idx_quarter	idx_quarteryear	...	idx_mday	idx_qday	idx_yday	idx_weekend	idx_hour	idx_minute	idx_second	idx_msecond	idx_nsecond	idx_am_pm
0	2023-01-01	1672531200	2023	2022	1	0	0	1	1	2023Q1	...	1	1	1	1	0	0	0	0	0	am
1	2023-01-02	1672617600	2023	2023	0	0	0	1	1	2023Q1	...	2	2	2	0	0	0	0	0	0	am
2	2023-01-03	1672704000	2023	2023	0	0	0	1	1	2023Q1	...	3	3	3	0	0	0	0	0	0	am
3	2023-01-04	1672790400	2023	2023	0	0	0	1	1	2023Q1	...	4	4	4	0	0	0	0	0	0	am
4	2023-01-05	1672876800	2023	2023	0	0	0	1	1	2023Q1	...	5	5	5	0	0	0	0	0	0	am
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
361	2023-12-28	1703721600	2023	2023	0	0	0	2	4	2023Q4	...	28	89	362	0	0	0	0	0	0	am
362	2023-12-29	1703808000	2023	2023	0	0	0	2	4	2023Q4	...	29	90	363	0	0	0	0	0	0	am
363	2023-12-30	1703894400	2023	2023	0	0	0	2	4	2023Q4	...	30	91	364	0	0	0	0	0	0	am
364	2023-12-31	1703980800	2023	2023	0	1	0	2	4	2023Q4	...	31	92	365	1	0	0	0	0	0	am
365	2024-01-01	1704067200	2024	2024	1	0	1	1	1	2024Q1	...	1	1	1	0	0	0	0	0	0	am

366 rows × 30 columns

3 Next steps

Check out the Pandas Frequency Guide next.

4 More Coming Soon…

We are in the early stages of development. But it’s obvious the potential for pytimetk now in Python. 🐍

Please ⭐ us on GitHub (it takes 2-seconds and means a lot).
To make requests, please see our Project Roadmap GH Issue #2. You can make requests there.
Want to contribute? See our contributing guide here.