PyTimeTK Basics

PyTimeTK has one mission: To make time series analysis simpler, easier, and faster in Python. This goal requires some opinionated ways of treating time series in Python. We will conceptually lay out how pytimetk can help.

How this guide benefits you

This guide covers how to use pytimetk conceptually. Once you understand key concepts, you can go from basic to advanced time series analysis very fast.

Let’s first start with how to think about time series data conceptually. Time series data has 3 core properties.

1 The 3 Core Properties of Time Series Data

Every time series DataFrame should have the following properties:

  1. Time Series Index: A column containing ‘datetime64’ time stamps.
  2. Value Columns: One or more columns containing numeric data that can be aggregated and visualized by time
  3. Group Columns (Optional): One or more categorical or str columns that can be grouped by and time series can be evaluated by groups.

In practice here’s what this looks like using the “m4_daily” dataset:

Code
# Import packages
import pytimetk as tk
import pandas as pd
import numpy as np

# Import a Time Series Data Set
m4_daily_df = tk.load_dataset("m4_daily", parse_dates = ['date'])
m4_daily_df
id date value
0 D10 2014-07-03 2076.2
1 D10 2014-07-04 2073.4
2 D10 2014-07-05 2048.7
3 D10 2014-07-06 2048.9
4 D10 2014-07-07 2006.4
... ... ... ...
9738 D500 2012-09-19 9418.8
9739 D500 2012-09-20 9365.7
9740 D500 2012-09-21 9445.9
9741 D500 2012-09-22 9497.9
9742 D500 2012-09-23 9545.3

9743 rows × 3 columns

(Example: m4_daily dataset) 3 Core Properties of Time Series Data

We can see that the m4_daily dataset has:

  1. Time Series Index: The date column
  2. Value Column(s): The value column
  3. Group Column(s): The id column
Missing any of the 3 Core Properties of Time Series Data

If your data is not formatted properly for pytimetk, meaning it’s missing columns containing datetime, numeric values, or grouping columns, this can impact your ability to use pytimetk for time series anlysis.

No Pandas Index, No Problem

Timetk standardizes using a date column. This is to reduce friction in converting to other package formats like polars, which don’t use an an index (each row is indexed by its integer position).

2 The 2 Ways that Timetk Makes Time Series Analysis Easier

2 Types of Time Series Functions
  1. Pandas DataFrame Operations
  2. Pandas Series Operations

Timetk contains a number of functions designed to make time series analysis operations easier. In general, these operations come in 2 types of time series functions:

  1. Pandas DataFrame Operations: These functions work on pd.DataFrame objects and derivatives such as groupby() objects for Grouped Time Series Analysis. You will see data as the first parameter in these functions.

  2. Pandas Series Operations: These functions work on pd.Series objects.

    • Time Series Index Operations: Are designed for Time Series index. You will see idx as the first parameter of these functions. In these cases, these functions also work with datetime64 values (e.g. those produced when you parse_dates via pd.read_csv() or create time series with pd.date_range())

    • Numeric Operations: Are designed for Numeric Values. You will see x as the first parameter for these functions.

Let’s take a look at how to use the different types of Time Series Analysis functions in pytimetk. We’ll start with Type 1: Pandas DataFrame Operations.

2.1 Type 1: Pandas DataFrame Operations

Before we start using pytimetk, let’s make sure our data is set up properly.

Timetk Data Format Compliance

3 Core Properties Must Be Upheald

A pytimetk-Compliant Pandas DataFrame must have:

  1. Time Series Index: A Time Stamp column containing datetime64 values
  2. Value Column(s): The value column(s) containing float or int values
  3. Group Column(s): Optionally for grouped time series analysis, one or more columns containg str or categorical values (shown as an object)

If these are NOT upheld, this will impact your ability to use pytimetk DataFrame operations.

Inspect the DataFrame

Use the tk.glimpse() method to check compliance.

Using pytimetk glimpse() method, we can see that we have a compliant data frame with a date column containing datetime64 and a value column containing float64. For grouped analysis we have the id column containing object dtype.

Code
# Tip: Inspect for compliance with glimpse()
m4_daily_df.glimpse()
<class 'pandas.core.frame.DataFrame'>: 9743 rows of 3 columns
id:     object            ['D10', 'D10', 'D10', 'D10', 'D10', 'D10', 'D1 ...
date:   datetime64[ns]    [Timestamp('2014-07-03 00:00:00'), Timestamp(' ...
value:  float64           [2076.2, 2073.4, 2048.7, 2048.9, 2006.4, 2017. ...

Grouped Time Series Analysis with Summarize By Time

First, inspect how the summarize_by_time function works by calling help().

Code
# Review the summarize_by_time documentation (output not shown)
help(tk.summarize_by_time)
Help Doc Info: summarize_by_time()
  • The first parameter is data, indicating this is a DataFrame operation.
  • The Examples show different use cases for how to apply the function on a DataFrame

Let’s test the summarize_by_time() DataFrame operation out using the grouped approach with method chaining. DataFrame operations can be used as Pandas methods with method-chaining, which allows us to more succinctly apply time series operations.

Code
# Grouped Summarize By Time with Method Chaining
df_summarized = (
    m4_daily_df
        .groupby('id')
        .summarize_by_time(
            date_column  = 'date',
            value_column = 'value',
            freq         = 'QS', # QS = Quarter Start
            agg_func     = [
                'mean', 
                'median', 
                'min',
                ('q25', lambda x: np.quantile(x, 0.25)),
                ('q75', lambda x: np.quantile(x, 0.75)),
                'max',
                ('range',lambda x: x.max() - x.min()),
            ],
        )
)

df_summarized
id date value_mean value_median value_min value_q25 value_q75 value_max value_range
0 D10 2014-07-01 1960.078889 1979.90 1781.6 1915.225 2002.575 2076.2 294.6
1 D10 2014-10-01 2184.586957 2154.05 2022.8 2125.075 2274.150 2344.9 322.1
2 D10 2015-01-01 2309.830000 2312.30 2209.6 2284.575 2342.150 2392.4 182.8
3 D10 2015-04-01 2344.481319 2333.00 2185.1 2301.750 2391.000 2499.8 314.7
4 D10 2015-07-01 2156.754348 2186.70 1856.6 1997.250 2289.425 2368.1 511.5
... ... ... ... ... ... ... ... ... ...
105 D500 2011-07-01 9727.321739 9745.55 8964.5 9534.125 10003.900 10463.9 1499.4
106 D500 2011-10-01 8175.565217 7897.00 6755.0 7669.875 8592.575 9860.0 3105.0
107 D500 2012-01-01 8291.317582 8412.60 7471.5 7814.800 8677.850 8980.7 1509.2
108 D500 2012-04-01 8654.020879 8471.10 8245.6 8389.850 9017.250 9349.2 1103.6
109 D500 2012-07-01 8770.502353 8690.50 8348.1 8604.400 8846.000 9545.3 1197.2

110 rows × 9 columns

Key Takeaways: summarize_by_time()
  • The data must comply with the 3 core properties (date column, value column(s), and group column(s))
  • The aggregation functions were applied by combination of group (id) and resample (Quarter Start)
  • The result was a pandas DataFrame with group column, resampled date column, and summary values (mean, median, min, 25th-quantile, etc)

Another DataFrame Example: Creating 29 Engineered Features

Let’s examine another DataFrame function, tk.augment_timeseries_signature(). Feel free to inspect the documentation with help(tk.augment_timeseries_signature).

Code
# Creating 29 engineered features from the date column
# Not run: help(tk.augment_timeseries_signature)
df_augmented = (
    m4_daily_df
        .augment_timeseries_signature(date_column = 'date')
)

df_augmented.head()
id date value date_index_num date_year date_year_iso date_yearstart date_yearend date_leapyear date_half ... date_mday date_qday date_yday date_weekend date_hour date_minute date_second date_msecond date_nsecond date_am_pm
0 D10 2014-07-03 2076.2 1404345600 2014 2014 0 0 0 2 ... 3 3 184 0 0 0 0 0 0 am
1 D10 2014-07-04 2073.4 1404432000 2014 2014 0 0 0 2 ... 4 4 185 0 0 0 0 0 0 am
2 D10 2014-07-05 2048.7 1404518400 2014 2014 0 0 0 2 ... 5 5 186 0 0 0 0 0 0 am
3 D10 2014-07-06 2048.9 1404604800 2014 2014 0 0 0 2 ... 6 6 187 1 0 0 0 0 0 am
4 D10 2014-07-07 2006.4 1404691200 2014 2014 0 0 0 2 ... 7 7 188 0 0 0 0 0 0 am

5 rows × 32 columns

Key Takeaways: augment_timeseries_signature()
  • The data must comply with the 1 of the 3 core properties (date column)
  • The result was a pandas DataFrame with 29 time series features that can be used for Machine Learning and Forecasting

Making Future Dates with Future Frame

A common time series task before forecasting with machine learning models is to make a future DataFrame some length_out into the future. You can do this with tk.future_frame(). Here’s how.

Code
# Preparing a time series data set for Machine Learning Forecasting
full_augmented_df = (
    m4_daily_df 
        .groupby('id')
        .future_frame('date', length_out = 365)
        .augment_timeseries_signature('date')
)
full_augmented_df
id date value date_index_num date_year date_year_iso date_yearstart date_yearend date_leapyear date_half ... date_mday date_qday date_yday date_weekend date_hour date_minute date_second date_msecond date_nsecond date_am_pm
0 D10 2014-07-03 2076.2 1404345600 2014 2014 0 0 0 2 ... 3 3 184 0 0 0 0 0 0 am
1 D10 2014-07-04 2073.4 1404432000 2014 2014 0 0 0 2 ... 4 4 185 0 0 0 0 0 0 am
2 D10 2014-07-05 2048.7 1404518400 2014 2014 0 0 0 2 ... 5 5 186 0 0 0 0 0 0 am
3 D10 2014-07-06 2048.9 1404604800 2014 2014 0 0 0 2 ... 6 6 187 1 0 0 0 0 0 am
4 D10 2014-07-07 2006.4 1404691200 2014 2014 0 0 0 2 ... 7 7 188 0 0 0 0 0 0 am
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11198 D500 2013-09-19 NaN 1379548800 2013 2013 0 0 0 2 ... 19 81 262 0 0 0 0 0 0 am
11199 D500 2013-09-20 NaN 1379635200 2013 2013 0 0 0 2 ... 20 82 263 0 0 0 0 0 0 am
11200 D500 2013-09-21 NaN 1379721600 2013 2013 0 0 0 2 ... 21 83 264 0 0 0 0 0 0 am
11201 D500 2013-09-22 NaN 1379808000 2013 2013 0 0 0 2 ... 22 84 265 1 0 0 0 0 0 am
11202 D500 2013-09-23 NaN 1379894400 2013 2013 0 0 0 2 ... 23 85 266 0 0 0 0 0 0 am

11203 rows × 32 columns

We can then get the future data by keying in on the data with value column that is missing (np.nan).

Code
# Get the future data (just the observations that haven't happened yet)
future_df = (
    full_augmented_df
        .query('value.isna()')
)
future_df
id date value date_index_num date_year date_year_iso date_yearstart date_yearend date_leapyear date_half ... date_mday date_qday date_yday date_weekend date_hour date_minute date_second date_msecond date_nsecond date_am_pm
9743 D10 2016-05-07 NaN 1462579200 2016 2016 0 0 1 1 ... 7 37 128 0 0 0 0 0 0 am
9744 D10 2016-05-08 NaN 1462665600 2016 2016 0 0 1 1 ... 8 38 129 1 0 0 0 0 0 am
9745 D10 2016-05-09 NaN 1462752000 2016 2016 0 0 1 1 ... 9 39 130 0 0 0 0 0 0 am
9746 D10 2016-05-10 NaN 1462838400 2016 2016 0 0 1 1 ... 10 40 131 0 0 0 0 0 0 am
9747 D10 2016-05-11 NaN 1462924800 2016 2016 0 0 1 1 ... 11 41 132 0 0 0 0 0 0 am
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11198 D500 2013-09-19 NaN 1379548800 2013 2013 0 0 0 2 ... 19 81 262 0 0 0 0 0 0 am
11199 D500 2013-09-20 NaN 1379635200 2013 2013 0 0 0 2 ... 20 82 263 0 0 0 0 0 0 am
11200 D500 2013-09-21 NaN 1379721600 2013 2013 0 0 0 2 ... 21 83 264 0 0 0 0 0 0 am
11201 D500 2013-09-22 NaN 1379808000 2013 2013 0 0 0 2 ... 22 84 265 1 0 0 0 0 0 am
11202 D500 2013-09-23 NaN 1379894400 2013 2013 0 0 0 2 ... 23 85 266 0 0 0 0 0 0 am

1460 rows × 32 columns

2.2 Type 2: Pandas Series Operations

The main difference between a DataFrame operation and a Series operation is that we are operating on an array of values from typically one of the following dtypes:

  1. Timestamps (datetime64)
  2. Numeric (float64 or int64)

The first argument of Series operations that operate on Timestamps will always be idx.

Let’s take a look at one shall we? We’ll start with a common action: Making future time series from an existing time series with a regular frequency.

The Make Future Time Series Function

Say we have a monthly sequence of timestamps. What if we want to create a forecast where we predict 12 months into the future? Well, we will need to create 12 future timestamps. Here’s how.

First create a pd.date_range() with dates starting at the beginning of each month.

Code
# Make a monthly date range
dates_dt = pd.date_range("2023-01", "2024-01", freq="MS")
dates_dt
DatetimeIndex(['2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01',
               '2023-05-01', '2023-06-01', '2023-07-01', '2023-08-01',
               '2023-09-01', '2023-10-01', '2023-11-01', '2023-12-01',
               '2024-01-01'],
              dtype='datetime64[ns]', freq='MS')

Next, use tk.make_future_timeseries() to create the next 12 timestamps in the sequence.

Code
# Pandas Series: Future Dates
future_series = pd.Series(dates_dt).make_future_timeseries(12)
future_series
0    2024-02-01
1    2024-03-01
2    2024-04-01
3    2024-05-01
4    2024-06-01
5    2024-07-01
6    2024-08-01
7    2024-09-01
8    2024-10-01
9    2024-11-01
10   2024-12-01
11   2025-01-01
dtype: datetime64[ns]
Code
# DateTimeIndex: Future Dates
future_dt = tk.make_future_timeseries(
    idx      = dates_dt,
    length_out = 12
)
future_dt
0    2024-02-01
1    2024-03-01
2    2024-04-01
3    2024-05-01
4    2024-06-01
5    2024-07-01
6    2024-08-01
7    2024-09-01
8    2024-10-01
9    2024-11-01
10   2024-12-01
11   2025-01-01
dtype: datetime64[ns]

We can combine the actual and future timestamps into one combined timeseries.

Code
# Combining the 2 series and resetting the index
combined_timeseries = (
    pd.concat(
        [pd.Series(dates_dt), pd.Series(future_dt)],
        axis=0
    )
        .reset_index(drop = True)
)

combined_timeseries
0    2023-01-01
1    2023-02-01
2    2023-03-01
3    2023-04-01
4    2023-05-01
5    2023-06-01
6    2023-07-01
7    2023-08-01
8    2023-09-01
9    2023-10-01
10   2023-11-01
11   2023-12-01
12   2024-01-01
13   2024-02-01
14   2024-03-01
15   2024-04-01
16   2024-05-01
17   2024-06-01
18   2024-07-01
19   2024-08-01
20   2024-09-01
21   2024-10-01
22   2024-11-01
23   2024-12-01
24   2025-01-01
dtype: datetime64[ns]

Next, we’ll take a look at how to go from an irregular time series to a regular time series.

Flooring Dates

An example is tk.floor_date, which is used to round down dates. See help(tk.floor_date).

Flooring dates is often used as part of a strategy to go from an irregular time series to regular by combining with an aggregation. Often summarize_by_time() is used (I’ll share why shortly). But conceptually, date flooring is the secret.

Code
# Monthly flooring rounds dates down to 1st of the month
m4_daily_df['date'].floor_date(unit = "M")
0      2014-07-01
1      2014-07-01
2      2014-07-01
3      2014-07-01
4      2014-07-01
          ...    
9738   2014-07-01
9739   2014-07-01
9740   2014-07-01
9741   2014-07-01
9742   2014-07-01
Name: date, Length: 9743, dtype: datetime64[ns]
Code
# Before Flooring
m4_daily_df['date']
0      2014-07-03
1      2014-07-03
2      2014-07-03
3      2014-07-03
4      2014-07-03
          ...    
9738   2014-07-03
9739   2014-07-03
9740   2014-07-03
9741   2014-07-03
9742   2014-07-03
Name: date, Length: 9743, dtype: datetime64[ns]

This “date flooring” operation can be useful for creating date groupings.

Code
# Adding a date group with floor_date()
dates_grouped_by_month = (
    m4_daily_df
        .assign(date_group = lambda x: x['date'].floor_date("M"))
)

dates_grouped_by_month
id date value date_group
0 D10 2014-07-03 2076.2 2014-07-01
1 D10 2014-07-03 2073.4 2014-07-01
2 D10 2014-07-03 2048.7 2014-07-01
3 D10 2014-07-03 2048.9 2014-07-01
4 D10 2014-07-03 2006.4 2014-07-01
... ... ... ... ...
9738 D500 2014-07-03 9418.8 2014-07-01
9739 D500 2014-07-03 9365.7 2014-07-01
9740 D500 2014-07-03 9445.9 2014-07-01
9741 D500 2014-07-03 9497.9 2014-07-01
9742 D500 2014-07-03 9545.3 2014-07-01

9743 rows × 4 columns

We can then do grouped operations.

Code
# Example of a grouped operation with floored dates
summary_df = (
    dates_grouped_by_month
        .drop('date', axis=1) \
        .groupby(['id', 'date_group'])
        .mean() \
        .reset_index()
)

summary_df
id date_group value
0 D10 2014-07-01 2261.606825
1 D160 2014-07-01 9243.155254
2 D410 2014-07-01 8259.786346
3 D500 2014-07-01 8287.728789

Of course for this operation, we can do it faster with summarize_by_time() (and it’s much more flexible).

Code
# Summarize by time is less code and more flexible
(
    m4_daily_df 
        .groupby('id')
        .summarize_by_time(
            'date', 'value', 
            freq = "MS",
            agg_func = ['mean', 'median', 'min', 'max']
        )
)
id date value_mean value_median value_min value_max
0 D10 2014-07-01 2261.606825 2302.30 1781.60 2649.30
1 D160 2014-07-01 9243.155254 10097.30 1734.90 19432.50
2 D410 2014-07-01 8259.786346 8382.81 6309.38 9540.62
3 D500 2014-07-01 8287.728789 7662.10 4172.10 14954.10

And that’s the core idea behind pytimetk, writing less code and getting more.

Next, let’s do one more function. The brother of augment_timeseries_signature()

The Get Time Series Signature Function

This function takes a pandas Series or DateTimeIndex and returns a DataFrame containing the 29 engineered features.

Start with either a DateTimeIndex…

Code
timestamps_dt = pd.date_range("2023", "2024", freq = "D")
timestamps_dt
DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05', '2023-01-06', '2023-01-07', '2023-01-08',
               '2023-01-09', '2023-01-10',
               ...
               '2023-12-23', '2023-12-24', '2023-12-25', '2023-12-26',
               '2023-12-27', '2023-12-28', '2023-12-29', '2023-12-30',
               '2023-12-31', '2024-01-01'],
              dtype='datetime64[ns]', length=366, freq='D')

… Or a Pandas Series.

Code
timestamps_series = pd.Series(timestamps_dt)
timestamps_series
0     2023-01-01
1     2023-01-02
2     2023-01-03
3     2023-01-04
4     2023-01-05
         ...    
361   2023-12-28
362   2023-12-29
363   2023-12-30
364   2023-12-31
365   2024-01-01
Length: 366, dtype: datetime64[ns]

And you can use the pandas Series function, tk.get_timeseries_signature() to create 29 features from the date sequence.

Code
# Pandas series: get_timeseries_signature
timestamps_series.get_timeseries_signature()
idx idx_index_num idx_year idx_year_iso idx_yearstart idx_yearend idx_leapyear idx_half idx_quarter idx_quarteryear ... idx_mday idx_qday idx_yday idx_weekend idx_hour idx_minute idx_second idx_msecond idx_nsecond idx_am_pm
0 2023-01-01 1672531200 2023 2022 1 0 0 1 1 2023Q1 ... 1 1 1 1 0 0 0 0 0 am
1 2023-01-02 1672617600 2023 2023 0 0 0 1 1 2023Q1 ... 2 2 2 0 0 0 0 0 0 am
2 2023-01-03 1672704000 2023 2023 0 0 0 1 1 2023Q1 ... 3 3 3 0 0 0 0 0 0 am
3 2023-01-04 1672790400 2023 2023 0 0 0 1 1 2023Q1 ... 4 4 4 0 0 0 0 0 0 am
4 2023-01-05 1672876800 2023 2023 0 0 0 1 1 2023Q1 ... 5 5 5 0 0 0 0 0 0 am
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
361 2023-12-28 1703721600 2023 2023 0 0 0 2 4 2023Q4 ... 28 89 362 0 0 0 0 0 0 am
362 2023-12-29 1703808000 2023 2023 0 0 0 2 4 2023Q4 ... 29 90 363 0 0 0 0 0 0 am
363 2023-12-30 1703894400 2023 2023 0 0 0 2 4 2023Q4 ... 30 91 364 0 0 0 0 0 0 am
364 2023-12-31 1703980800 2023 2023 0 1 0 2 4 2023Q4 ... 31 92 365 1 0 0 0 0 0 am
365 2024-01-01 1704067200 2024 2024 1 0 1 1 1 2024Q1 ... 1 1 1 0 0 0 0 0 0 am

366 rows × 30 columns

Code
# DateTimeIndex: get_timeseries_signature
tk.get_timeseries_signature(timestamps_dt)
idx idx_index_num idx_year idx_year_iso idx_yearstart idx_yearend idx_leapyear idx_half idx_quarter idx_quarteryear ... idx_mday idx_qday idx_yday idx_weekend idx_hour idx_minute idx_second idx_msecond idx_nsecond idx_am_pm
0 2023-01-01 1672531200 2023 2022 1 0 0 1 1 2023Q1 ... 1 1 1 1 0 0 0 0 0 am
1 2023-01-02 1672617600 2023 2023 0 0 0 1 1 2023Q1 ... 2 2 2 0 0 0 0 0 0 am
2 2023-01-03 1672704000 2023 2023 0 0 0 1 1 2023Q1 ... 3 3 3 0 0 0 0 0 0 am
3 2023-01-04 1672790400 2023 2023 0 0 0 1 1 2023Q1 ... 4 4 4 0 0 0 0 0 0 am
4 2023-01-05 1672876800 2023 2023 0 0 0 1 1 2023Q1 ... 5 5 5 0 0 0 0 0 0 am
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
361 2023-12-28 1703721600 2023 2023 0 0 0 2 4 2023Q4 ... 28 89 362 0 0 0 0 0 0 am
362 2023-12-29 1703808000 2023 2023 0 0 0 2 4 2023Q4 ... 29 90 363 0 0 0 0 0 0 am
363 2023-12-30 1703894400 2023 2023 0 0 0 2 4 2023Q4 ... 30 91 364 0 0 0 0 0 0 am
364 2023-12-31 1703980800 2023 2023 0 1 0 2 4 2023Q4 ... 31 92 365 1 0 0 0 0 0 am
365 2024-01-01 1704067200 2024 2024 1 0 1 1 1 2024Q1 ... 1 1 1 0 0 0 0 0 0 am

366 rows × 30 columns

3 Next steps

Check out the Pandas Frequency Guide next.

4 More Coming Soon…

We are in the early stages of development. But it’s obvious the potential for pytimetk now in Python. 🐍