The summarize_by_time function aggregates data by a specified time period and one or more numeric columns, allowing for grouping and customization of the time-based aggregation.
Parameters
Name
Type
Description
Default
data
pd.DataFrame or pd.core.groupby.generic.DataFrameGroupBy
A pandas DataFrame or a pandas GroupBy object. This is the data that you want to summarize by time.
required
date_column
str
The name of the column in the data frame that contains the dates or timestamps to be aggregated by. This column must be of type datetime64.
required
value_column
str or list
The value_column parameter is the name of one or more columns in the DataFrame that you want to aggregate by. It can be either a string representing a single column name, or a list of strings representing multiple column names.
required
freq
str
The freq parameter specifies the frequency at which the data should be aggregated. It accepts a string representing a pandas frequency offset, such as “D” for daily or “MS” for month start. The default value is “D”, which means the data will be aggregated on a daily basis. Some common frequency aliases include: - S: secondly frequency - min: minute frequency - H: hourly frequency - D: daily frequency - W: weekly frequency - M: month end frequency - MS: month start frequency - Q: quarter end frequency - QS: quarter start frequency - Y: year end frequency - YS: year start frequency
'D'
agg_func
list
The agg_func parameter is used to specify one or more aggregating functions to apply to the value column(s) during the summarization process. It can be a single function or a list of functions. The default value is "sum", which represents the sum function. Some common aggregating functions include: - “sum”: Sum of values - “mean”: Mean of values - “median”: Median of values - “min”: Minimum of values - “max”: Maximum of values - “std”: Standard deviation of values - “var”: Variance of values - “first”: First value in group - “last”: Last value in group - “count”: Count of values - “nunique”: Number of unique values - “corr”: Correlation between values Pandas Engine Only: Custom lambda aggregating functions can be used too. Here are several common examples: - (“q25”, lambda x: x.quantile(0.25)): 25th percentile of values - (“q75”, lambda x: x.quantile(0.75)): 75th percentile of values - (“iqr”, lambda x: x.quantile(0.75) - x.quantile(0.25)): Interquartile range of values - (“range”, lambda x: x.max() - x.min()): Range of values
'sum'
wide_format
bool
A boolean parameter that determines whether the output should be in “wide” or “long” format. If set to True, the output will be in wide format, where each group is represented by a separate column. If set to False, the output will be in long format, where each group is represented by a separate row. The default value is False.
False
fillna
int
The fillna parameter is used to specify the value to fill missing data with. By default, it is set to 0. If you want to keep missing values as NaN, you can use np.nan as the value for fillna.
0
engine
str
The engine parameter is used to specify the engine to use for summarizing the data. It can be either “pandas” or “polars”. - The default value is “pandas”. - When “polars”, the function will internally use the polars library for summarizing the data. This can be faster than using “pandas” for large datasets.
'pandas'
Returns
Type
Description
pd.DataFrame
A Pandas DataFrame that is summarized by time.
Examples
import pytimetk as tkimport pandas as pddf = tk.load_dataset('bike_sales_sample', parse_dates = ['order_date'])df
order_id
order_line
order_date
quantity
price
total_price
model
category_1
category_2
frame_material
bikeshop_name
city
state
0
1
1
2011-01-07
1
6070
6070
Jekyll Carbon 2
Mountain
Over Mountain
Carbon
Ithaca Mountain Climbers
Ithaca
NY
1
1
2
2011-01-07
1
5970
5970
Trigger Carbon 2
Mountain
Over Mountain
Carbon
Ithaca Mountain Climbers
Ithaca
NY
2
2
1
2011-01-10
1
2770
2770
Beast of the East 1
Mountain
Trail
Aluminum
Kansas City 29ers
Kansas City
KS
3
2
2
2011-01-10
1
5970
5970
Trigger Carbon 2
Mountain
Over Mountain
Carbon
Kansas City 29ers
Kansas City
KS
4
3
1
2011-01-10
1
10660
10660
Supersix Evo Hi-Mod Team
Road
Elite Road
Carbon
Louisville Race Equipment
Louisville
KY
...
...
...
...
...
...
...
...
...
...
...
...
...
...
2461
321
3
2011-12-22
1
1410
1410
CAAD8 105
Road
Elite Road
Aluminum
Miami Race Equipment
Miami
FL
2462
322
1
2011-12-28
1
1250
1250
Synapse Disc Tiagra
Road
Endurance Road
Aluminum
Phoenix Bi-peds
Phoenix
AZ
2463
322
2
2011-12-28
1
2660
2660
Bad Habit 2
Mountain
Trail
Aluminum
Phoenix Bi-peds
Phoenix
AZ
2464
322
3
2011-12-28
1
2340
2340
F-Si 1
Mountain
Cross Country Race
Aluminum
Phoenix Bi-peds
Phoenix
AZ
2465
322
4
2011-12-28
1
5860
5860
Synapse Hi-Mod Dura Ace
Road
Endurance Road
Carbon
Phoenix Bi-peds
Phoenix
AZ
2466 rows × 13 columns
# Example 1 - Summarize by time with a DataFrame object, pandas engine( df .summarize_by_time( date_column ='order_date', value_column ='total_price', freq ="MS", agg_func = ['mean', 'sum'], engine ='pandas' ))
order_date
total_price_mean
total_price_sum
0
2011-01-01
4600.142857
483015
1
2011-02-01
4611.408730
1162075
2
2011-03-01
5196.653543
659975
3
2011-04-01
4533.846154
1827140
4
2011-05-01
4097.912621
844170
5
2011-06-01
4544.839228
1413445
6
2011-07-01
4976.791667
1194430
7
2011-08-01
4961.970803
679790
8
2011-09-01
4682.298851
814720
9
2011-10-01
3930.053476
734920
10
2011-11-01
4768.175355
1006085
11
2011-12-01
4186.902655
473120
# Example 2 - Summarize by time with a GroupBy object (Wide Format), polars engine( df .groupby(['category_1', 'frame_material']) .summarize_by_time( date_column ='order_date', value_column = ['total_price', 'quantity'], freq ='MS', agg_func ='sum', wide_format =True, engine ='polars' ))
order_date
total_price_sum_Mountain_Aluminum
total_price_sum_Mountain_Carbon
total_price_sum_Road_Aluminum
total_price_sum_Road_Carbon
quantity_sum_Mountain_Aluminum
quantity_sum_Mountain_Carbon
quantity_sum_Road_Aluminum
quantity_sum_Road_Carbon
0
2011-01-01
66290
155200
61005
200520
34
23
30
41
1
2011-02-01
245115
415440
100480
401040
118
68
52
93
2
2011-03-01
82025
276830
63390
237730
41
46
33
54
3
2011-04-01
340725
735250
197705
553460
164
130
104
144
4
2011-05-01
160130
290310
127600
266130
93
53
75
81
5
2011-06-01
183680
539360
174655
515750
96
91
82
142
6
2011-07-01
186030
581710
98090
328600
94
91
53
82
7
2011-08-01
119785
241470
65855
252680
53
34
36
69
8
2011-09-01
100455
300670
78485
335110
59
47
36
77
9
2011-10-01
105035
272300
83105
274480
61
43
42
71
10
2011-11-01
102045
447300
90050
366690
55
79
51
95
11
2011-12-01
111125
164930
45555
151510
55
27
27
43
# Example 3 - Summarize by time with a GroupBy object (Wide Format)( df .groupby('category_1') .summarize_by_time( date_column ='order_date', value_column ='total_price', freq ='MS', agg_func ='sum', wide_format =True, engine ='pandas' ))
order_date
total_price_Mountain
total_price_Road
0
2011-01-01
221490
261525
1
2011-02-01
660555
501520
2
2011-03-01
358855
301120
3
2011-04-01
1075975
751165
4
2011-05-01
450440
393730
5
2011-06-01
723040
690405
6
2011-07-01
767740
426690
7
2011-08-01
361255
318535
8
2011-09-01
401125
413595
9
2011-10-01
377335
357585
10
2011-11-01
549345
456740
11
2011-12-01
276055
197065
# Example 4 - Summarize by time with a GroupBy object and multiple value columns and summaries (Wide Format)# Note - This example only works with the pandas engine( df .groupby('category_1') .summarize_by_time( date_column ='order_date', value_column = ['total_price', 'quantity'], freq ='MS', agg_func = ['sum', 'mean', ('q25', lambda x: x.quantile(0.25)), ('q75', lambda x: x.quantile(0.75)) ], wide_format =False, engine ='pandas' ))