PyTimeTK has one mission: To make time series analysis simpler, easier, and faster in Python. This goal requires some opinionated ways of treating time series in Python. We will conceptually lay out how pytimetk can help.
How this guide benefits you
This guide covers how to use pytimetk conceptually. Once you understand key concepts, you can go from basic to advanced time series analysis very fast.
Let’s first start with how to think about time series data conceptually. Time series data has 3 core properties.
1 The 3 Core Properties of Time Series Data
Every time series DataFrame should have the following properties:
Time Series Index: A column containing ‘datetime64’ time stamps.
Value Columns: One or more columns containing numeric data that can be aggregated and visualized by time
Group Columns (Optional): One or more categorical or str columns that can be grouped by and time series can be evaluated by groups.
In practice here’s what this looks like using the “m4_daily” dataset:
Code
# Import packagesimport pytimetk as tkimport pandas as pdimport numpy as np# Import a Time Series Data Setm4_daily_df = tk.load_dataset("m4_daily", parse_dates = ['date'])m4_daily_df
id
date
value
0
D10
2014-07-03
2076.2
1
D10
2014-07-04
2073.4
2
D10
2014-07-05
2048.7
3
D10
2014-07-06
2048.9
4
D10
2014-07-07
2006.4
...
...
...
...
9738
D500
2012-09-19
9418.8
9739
D500
2012-09-20
9365.7
9740
D500
2012-09-21
9445.9
9741
D500
2012-09-22
9497.9
9742
D500
2012-09-23
9545.3
9743 rows × 3 columns
(Example: m4_daily dataset) 3 Core Properties of Time Series Data
We can see that the m4_daily dataset has:
Time Series Index: The date column
Value Column(s): The value column
Group Column(s): The id column
Missing any of the 3 Core Properties of Time Series Data
If your data is not formatted properly for pytimetk, meaning it’s missing columns containing datetime, numeric values, or grouping columns, this can impact your ability to use pytimetk for time series anlysis.
No Pandas Index, No Problem
Timetk standardizes using a date column. This is to reduce friction in converting to other package formats like polars, which don’t use an an index (each row is indexed by its integer position).
2 The 2 Ways that Timetk Makes Time Series Analysis Easier
2 Types of Time Series Functions
Pandas DataFrame Operations
Pandas Series Operations
Timetk contains a number of functions designed to make time series analysis operations easier. In general, these operations come in 2 types of time series functions:
Pandas DataFrame Operations: These functions work on pd.DataFrame objects and derivatives such as groupby() objects for Grouped Time Series Analysis. You will see data as the first parameter in these functions.
Pandas Series Operations: These functions work on pd.Series objects.
Time Series Index Operations: Are designed for Time Series index. You will see idx as the first parameter of these functions. In these cases, these functions also work with datetime64 values (e.g. those produced when you parse_dates via pd.read_csv() or create time series with pd.date_range())
Numeric Operations: Are designed for Numeric Values. You will see x as the first parameter for these functions.
Let’s take a look at how to use the different types of Time Series Analysis functions in pytimetk. We’ll start with Type 1: Pandas DataFrame Operations.
2.1 Type 1: Pandas DataFrame Operations
Before we start using pytimetk, let’s make sure our data is set up properly.
Timetk Data Format Compliance
3 Core Properties Must Be Upheald
A pytimetk-Compliant Pandas DataFrame must have:
Time Series Index: A Time Stamp column containing datetime64 values
Value Column(s): The value column(s) containing float or int values
Group Column(s): Optionally for grouped time series analysis, one or more columns containg str or categorical values (shown as an object)
If these are NOT upheld, this will impact your ability to use pytimetk DataFrame operations.
Inspect the DataFrame
Use the tk.glimpse() method to check compliance.
Using pytimetk glimpse() method, we can see that we have a compliant data frame with a date column containing datetime64 and a value column containing float64. For grouped analysis we have the id column containing object dtype.
Code
# Tip: Inspect for compliance with glimpse()m4_daily_df.glimpse()
Grouped Time Series Analysis with Summarize By Time
First, inspect how the summarize_by_time function works by calling help().
Code
# Review the summarize_by_time documentation (output not shown)help(tk.summarize_by_time)
Help Doc Info: summarize_by_time()
The first parameter is data, indicating this is a DataFrame operation.
The Examples show different use cases for how to apply the function on a DataFrame
Let’s test the summarize_by_time() DataFrame operation out using the grouped approach with method chaining. DataFrame operations can be used as Pandas methods with method-chaining, which allows us to more succinctly apply time series operations.
The data must comply with the 3 core properties (date column, value column(s), and group column(s))
The aggregation functions were applied by combination of group (id) and resample (Quarter Start)
The result was a pandas DataFrame with group column, resampled date column, and summary values (mean, median, min, 25th-quantile, etc)
Another DataFrame Example: Creating 29 Engineered Features
Let’s examine another DataFrame function, tk.augment_timeseries_signature(). Feel free to inspect the documentation with help(tk.augment_timeseries_signature).
Code
# Creating 29 engineered features from the date column# Not run: help(tk.augment_timeseries_signature)df_augmented = ( m4_daily_df .augment_timeseries_signature(date_column ='date'))df_augmented.head()
id
date
value
date_index_num
date_year
date_year_iso
date_yearstart
date_yearend
date_leapyear
date_half
...
date_mday
date_qday
date_yday
date_weekend
date_hour
date_minute
date_second
date_msecond
date_nsecond
date_am_pm
0
D10
2014-07-03
2076.2
1404345600
2014
2014
0
0
0
2
...
3
3
184
0
0
0
0
0
0
am
1
D10
2014-07-04
2073.4
1404432000
2014
2014
0
0
0
2
...
4
4
185
0
0
0
0
0
0
am
2
D10
2014-07-05
2048.7
1404518400
2014
2014
0
0
0
2
...
5
5
186
0
0
0
0
0
0
am
3
D10
2014-07-06
2048.9
1404604800
2014
2014
0
0
0
2
...
6
6
187
1
0
0
0
0
0
am
4
D10
2014-07-07
2006.4
1404691200
2014
2014
0
0
0
2
...
7
7
188
0
0
0
0
0
0
am
5 rows × 32 columns
Key Takeaways: augment_timeseries_signature()
The data must comply with the 1 of the 3 core properties (date column)
The result was a pandas DataFrame with 29 time series features that can be used for Machine Learning and Forecasting
Making Future Dates with Future Frame
A common time series task before forecasting with machine learning models is to make a future DataFrame some length_out into the future. You can do this with tk.future_frame(). Here’s how.
Code
# Preparing a time series data set for Machine Learning Forecastingfull_augmented_df = ( m4_daily_df .groupby('id') .future_frame('date', length_out =365) .augment_timeseries_signature('date'))full_augmented_df
id
date
value
date_index_num
date_year
date_year_iso
date_yearstart
date_yearend
date_leapyear
date_half
...
date_mday
date_qday
date_yday
date_weekend
date_hour
date_minute
date_second
date_msecond
date_nsecond
date_am_pm
0
D10
2014-07-03
2076.2
1404345600
2014
2014
0
0
0
2
...
3
3
184
0
0
0
0
0
0
am
1
D10
2014-07-04
2073.4
1404432000
2014
2014
0
0
0
2
...
4
4
185
0
0
0
0
0
0
am
2
D10
2014-07-05
2048.7
1404518400
2014
2014
0
0
0
2
...
5
5
186
0
0
0
0
0
0
am
3
D10
2014-07-06
2048.9
1404604800
2014
2014
0
0
0
2
...
6
6
187
1
0
0
0
0
0
am
4
D10
2014-07-07
2006.4
1404691200
2014
2014
0
0
0
2
...
7
7
188
0
0
0
0
0
0
am
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
11198
D500
2013-09-19
NaN
1379548800
2013
2013
0
0
0
2
...
19
81
262
0
0
0
0
0
0
am
11199
D500
2013-09-20
NaN
1379635200
2013
2013
0
0
0
2
...
20
82
263
0
0
0
0
0
0
am
11200
D500
2013-09-21
NaN
1379721600
2013
2013
0
0
0
2
...
21
83
264
0
0
0
0
0
0
am
11201
D500
2013-09-22
NaN
1379808000
2013
2013
0
0
0
2
...
22
84
265
1
0
0
0
0
0
am
11202
D500
2013-09-23
NaN
1379894400
2013
2013
0
0
0
2
...
23
85
266
0
0
0
0
0
0
am
11203 rows × 32 columns
We can then get the future data by keying in on the data with value column that is missing (np.nan).
Code
# Get the future data (just the observations that haven't happened yet)future_df = ( full_augmented_df .query('value.isna()'))future_df
id
date
value
date_index_num
date_year
date_year_iso
date_yearstart
date_yearend
date_leapyear
date_half
...
date_mday
date_qday
date_yday
date_weekend
date_hour
date_minute
date_second
date_msecond
date_nsecond
date_am_pm
9743
D10
2016-05-07
NaN
1462579200
2016
2016
0
0
1
1
...
7
37
128
0
0
0
0
0
0
am
9744
D10
2016-05-08
NaN
1462665600
2016
2016
0
0
1
1
...
8
38
129
1
0
0
0
0
0
am
9745
D10
2016-05-09
NaN
1462752000
2016
2016
0
0
1
1
...
9
39
130
0
0
0
0
0
0
am
9746
D10
2016-05-10
NaN
1462838400
2016
2016
0
0
1
1
...
10
40
131
0
0
0
0
0
0
am
9747
D10
2016-05-11
NaN
1462924800
2016
2016
0
0
1
1
...
11
41
132
0
0
0
0
0
0
am
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
11198
D500
2013-09-19
NaN
1379548800
2013
2013
0
0
0
2
...
19
81
262
0
0
0
0
0
0
am
11199
D500
2013-09-20
NaN
1379635200
2013
2013
0
0
0
2
...
20
82
263
0
0
0
0
0
0
am
11200
D500
2013-09-21
NaN
1379721600
2013
2013
0
0
0
2
...
21
83
264
0
0
0
0
0
0
am
11201
D500
2013-09-22
NaN
1379808000
2013
2013
0
0
0
2
...
22
84
265
1
0
0
0
0
0
am
11202
D500
2013-09-23
NaN
1379894400
2013
2013
0
0
0
2
...
23
85
266
0
0
0
0
0
0
am
1460 rows × 32 columns
2.2 Type 2: Pandas Series Operations
The main difference between a DataFrame operation and a Series operation is that we are operating on an array of values from typically one of the following dtypes:
Timestamps (datetime64)
Numeric (float64 or int64)
The first argument of Series operations that operate on Timestamps will always be idx.
Let’s take a look at one shall we? We’ll start with a common action: Making future time series from an existing time series with a regular frequency.
The Make Future Time Series Function
Say we have a monthly sequence of timestamps. What if we want to create a forecast where we predict 12 months into the future? Well, we will need to create 12 future timestamps. Here’s how.
First create a pd.date_range() with dates starting at the beginning of each month.
Code
# Make a monthly date rangedates_dt = pd.date_range("2023-01", "2024-01", freq="MS")dates_dt
We can combine the actual and future timestamps into one combined timeseries.
Code
# Combining the 2 series and resetting the indexcombined_timeseries = ( pd.concat( [pd.Series(dates_dt), pd.Series(future_dt)], axis=0 ) .reset_index(drop =True))combined_timeseries
Next, we’ll take a look at how to go from an irregular time series to a regular time series.
Flooring Dates
An example is tk.floor_date, which is used to round down dates. See help(tk.floor_date).
Flooring dates is often used as part of a strategy to go from an irregular time series to regular by combining with an aggregation. Often summarize_by_time() is used (I’ll share why shortly). But conceptually, date flooring is the secret.
This “date flooring” operation can be useful for creating date groupings.
Code
# Adding a date group with floor_date()dates_grouped_by_month = ( m4_daily_df .assign(date_group =lambda x: x['date'].floor_date("M")))dates_grouped_by_month
id
date
value
date_group
0
D10
2014-07-03
2076.2
2014-07-01
1
D10
2014-07-03
2073.4
2014-07-01
2
D10
2014-07-03
2048.7
2014-07-01
3
D10
2014-07-03
2048.9
2014-07-01
4
D10
2014-07-03
2006.4
2014-07-01
...
...
...
...
...
9738
D500
2014-07-03
9418.8
2014-07-01
9739
D500
2014-07-03
9365.7
2014-07-01
9740
D500
2014-07-03
9445.9
2014-07-01
9741
D500
2014-07-03
9497.9
2014-07-01
9742
D500
2014-07-03
9545.3
2014-07-01
9743 rows × 4 columns
We can then do grouped operations.
Code
# Example of a grouped operation with floored datessummary_df = ( dates_grouped_by_month .drop('date', axis=1) \ .groupby(['id', 'date_group']) .mean() \ .reset_index())summary_df
id
date_group
value
0
D10
2014-07-01
2261.606825
1
D160
2014-07-01
9243.155254
2
D410
2014-07-01
8259.786346
3
D500
2014-07-01
8287.728789
Of course for this operation, we can do it faster with summarize_by_time() (and it’s much more flexible).
Code
# Summarize by time is less code and more flexible( m4_daily_df .groupby('id') .summarize_by_time('date', 'value', freq ="MS", agg_func = ['mean', 'median', 'min', 'max'] ))
id
date
value_mean
value_median
value_min
value_max
0
D10
2014-07-01
2261.606825
2302.30
1781.60
2649.30
1
D160
2014-07-01
9243.155254
10097.30
1734.90
19432.50
2
D410
2014-07-01
8259.786346
8382.81
6309.38
9540.62
3
D500
2014-07-01
8287.728789
7662.10
4172.10
14954.10
And that’s the core idea behind pytimetk, writing less code and getting more.
Next, let’s do one more function. The brother of augment_timeseries_signature()…
The Get Time Series Signature Function
This function takes a pandas Series or DateTimeIndex and returns a DataFrame containing the 29 engineered features.