1 Introducing pytimetk: Simplifying Time Series Analysis for Everyone
Time series analysis is fundamental in many fields, from business forecasting to scientific research. While the Python ecosystem offers tools like pandas, they sometimes can be verbose and not optimized for all operations, especially for complex time-based aggregations and visualizations.
Enter pytimetk. Crafted with a blend of ease-of-use and computational efficiency, pytimetk significantly simplifies the process of time series manipulation and visualization. By leveraging the polars backend, you can experience speed improvements ranging from 3X to a whopping 3500X. Letโs dive into a comparative analysis.
Features/Properties
pytimetk
pandas (+matplotlib)
Speed
๐ 3X to 500X Faster
๐ข Standard
Code Simplicity
๐ Concise, readable syntax
๐ Often verbose
plot_timeseries()
๐จ 2 lines, no customization
๐จ 16 lines, customization needed
summarize_by_time()
๐ 2 lines, 13.4X faster
๐ 6 lines, 2 for-loops
pad_by_time()
โณ 2 lines, fills gaps in timeseries
โ No equivalent
anomalize()
๐ 2 lines, detects and corrects anomalies
โ No equivalent
augment_timeseries_signature()
๐ 1 line, all calendar features
๐ 30 lines of dt extractors
augment_rolling()
๐๏ธ 10X to 3500X faster
๐ข Slow Rolling Operations
As evident from the table, pytimetk is not just about speed; it also simplifies your codebase. For example, summarize_by_time(), converts a 6-line, double for-loop routine in pandas into a concise 2-line operation. And with the polars engine, get results 13.4X faster than pandas!
Similarly, plot_timeseries() dramatically streamlines the plotting process, encapsulating what would typically require 16 lines of matplotlib code into a mere 2-line command in pytimetk, without sacrificing customization or quality. And with plotly and plotnine engines, you can create interactive plots and beautiful static visualizations with just a few lines of code.
For calendar features, pytimetk offers augment_timeseries_signature() which cuts down on over 30 lines of pandas dt extractions. For rolling features, pytimetk offers augment_rolling(), which is 10X to 3500X faster than pandas. It also offers pad_by_time() to fill gaps in your time series data, and anomalize() to detect and correct anomalies in your time series data.
Join the revolution in time series analysis. Reduce your code complexity, increase your productivity, and harness the speed that pytimetk brings to your workflows.
First, import pytimetk as tk. This gets you access to the most important functions. Use tk.load_dataset() to load the โbike_sales_sampleโ dataset.
About the Bike Sales Sample Dataset
This dataset contains โorderlinesโ for orders recieved. The order_date column contains timestamps. We can use this column to peform sales aggregations (e.g. total revenue).
import pytimetk as tkimport pandas as pddf = tk.load_dataset('bike_sales_sample')df['order_date'] = pd.to_datetime(df['order_date'])df
order_id
order_line
order_date
quantity
price
total_price
model
category_1
category_2
frame_material
bikeshop_name
city
state
0
1
1
2011-01-07
1
6070
6070
Jekyll Carbon 2
Mountain
Over Mountain
Carbon
Ithaca Mountain Climbers
Ithaca
NY
1
1
2
2011-01-07
1
5970
5970
Trigger Carbon 2
Mountain
Over Mountain
Carbon
Ithaca Mountain Climbers
Ithaca
NY
2
2
1
2011-01-10
1
2770
2770
Beast of the East 1
Mountain
Trail
Aluminum
Kansas City 29ers
Kansas City
KS
3
2
2
2011-01-10
1
5970
5970
Trigger Carbon 2
Mountain
Over Mountain
Carbon
Kansas City 29ers
Kansas City
KS
4
3
1
2011-01-10
1
10660
10660
Supersix Evo Hi-Mod Team
Road
Elite Road
Carbon
Louisville Race Equipment
Louisville
KY
...
...
...
...
...
...
...
...
...
...
...
...
...
...
2461
321
3
2011-12-22
1
1410
1410
CAAD8 105
Road
Elite Road
Aluminum
Miami Race Equipment
Miami
FL
2462
322
1
2011-12-28
1
1250
1250
Synapse Disc Tiagra
Road
Endurance Road
Aluminum
Phoenix Bi-peds
Phoenix
AZ
2463
322
2
2011-12-28
1
2660
2660
Bad Habit 2
Mountain
Trail
Aluminum
Phoenix Bi-peds
Phoenix
AZ
2464
322
3
2011-12-28
1
2340
2340
F-Si 1
Mountain
Cross Country Race
Aluminum
Phoenix Bi-peds
Phoenix
AZ
2465
322
4
2011-12-28
1
5860
5860
Synapse Hi-Mod Dura Ace
Road
Endurance Road
Carbon
Phoenix Bi-peds
Phoenix
AZ
2466 rows ร 13 columns
Using summarize_by_time() for a Sales Analysis
Your company might be interested in sales patterns for various categories of bicycles. We can obtain a grouped monthly sales aggregation by category_1 in two lines of code:
First use pandasโs groupby() method to group the DataFrame on category_1
Next, use timetkโs summarize_by_time() method to apply the sum function my month start (โMSโ) and use wide_format = 'False' to return the dataframe in a long format (Note long format is the default). The default engine is "pandas". Selecting engine = "polars" allows us to improve the speed of the function.
The result is the total revenue for Mountain and Road bikes by month.
Plot time series is a quick and easy way to visualize time series and make professional time series plots.
With the data summarized by time, we can visualize with plot_timeseries(). pytimetk functions are groupby() aware meaning they understand if your data is grouped to do things by group. This is useful in time series where we often deal with 100s of time series groups.
The default engine in โplotnineโ for static plotting. Setting the engine = "plotly" returns an interactive plot.