summarize_by_time

summarize_by_time(data, date_column, value_column, freq='D', agg_func='sum', wide_format=False, fillna=0, engine='pandas')

Summarize a DataFrame or GroupBy object by time.

The summarize_by_time function aggregates data by a specified time period and one or more numeric columns, allowing for grouping and customization of the time-based aggregation.

Parameters

Name Type Description Default
data pd.DataFrame or pd.core.groupby.generic.DataFrameGroupBy A pandas DataFrame or a pandas GroupBy object. This is the data that you want to summarize by time. required
date_column str The name of the column in the data frame that contains the dates or timestamps to be aggregated by. This column must be of type datetime64. required
value_column str or list The value_column parameter is the name of one or more columns in the DataFrame that you want to aggregate by. It can be either a string representing a single column name, or a list of strings representing multiple column names. required
freq str The freq parameter specifies the frequency at which the data should be aggregated. It accepts a string representing a pandas frequency offset, such as “D” for daily or “MS” for month start. The default value is “D”, which means the data will be aggregated on a daily basis. Some common frequency aliases include: - S: secondly frequency - min: minute frequency - H: hourly frequency - D: daily frequency - W: weekly frequency - M: month end frequency - MS: month start frequency - Q: quarter end frequency - QS: quarter start frequency - Y: year end frequency - YS: year start frequency 'D'
agg_func list The agg_func parameter is used to specify one or more aggregating functions to apply to the value column(s) during the summarization process. It can be a single function or a list of functions. The default value is "sum", which represents the sum function. Some common aggregating functions include: - “sum”: Sum of values - “mean”: Mean of values - “median”: Median of values - “min”: Minimum of values - “max”: Maximum of values - “std”: Standard deviation of values - “var”: Variance of values - “first”: First value in group - “last”: Last value in group - “count”: Count of values - “nunique”: Number of unique values - “corr”: Correlation between values Pandas Engine Only: Custom lambda aggregating functions can be used too. Here are several common examples: - (“q25”, lambda x: x.quantile(0.25)): 25th percentile of values - (“q75”, lambda x: x.quantile(0.75)): 75th percentile of values - (“iqr”, lambda x: x.quantile(0.75) - x.quantile(0.25)): Interquartile range of values - (“range”, lambda x: x.max() - x.min()): Range of values 'sum'
wide_format bool A boolean parameter that determines whether the output should be in “wide” or “long” format. If set to True, the output will be in wide format, where each group is represented by a separate column. If set to False, the output will be in long format, where each group is represented by a separate row. The default value is False. False
fillna int The fillna parameter is used to specify the value to fill missing data with. By default, it is set to 0. If you want to keep missing values as NaN, you can use np.nan as the value for fillna. 0
engine str The engine parameter is used to specify the engine to use for summarizing the data. It can be either “pandas” or “polars”. - The default value is “pandas”. - When “polars”, the function will internally use the polars library for summarizing the data. This can be faster than using “pandas” for large datasets. 'pandas'

Returns

Type Description
pd.DataFrame A Pandas DataFrame that is summarized by time.

Examples

import pytimetk as tk
import pandas as pd

df = tk.load_dataset('bike_sales_sample', parse_dates = ['order_date'])

df
order_id order_line order_date quantity price total_price model category_1 category_2 frame_material bikeshop_name city state
0 1 1 2011-01-07 1 6070 6070 Jekyll Carbon 2 Mountain Over Mountain Carbon Ithaca Mountain Climbers Ithaca NY
1 1 2 2011-01-07 1 5970 5970 Trigger Carbon 2 Mountain Over Mountain Carbon Ithaca Mountain Climbers Ithaca NY
2 2 1 2011-01-10 1 2770 2770 Beast of the East 1 Mountain Trail Aluminum Kansas City 29ers Kansas City KS
3 2 2 2011-01-10 1 5970 5970 Trigger Carbon 2 Mountain Over Mountain Carbon Kansas City 29ers Kansas City KS
4 3 1 2011-01-10 1 10660 10660 Supersix Evo Hi-Mod Team Road Elite Road Carbon Louisville Race Equipment Louisville KY
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2461 321 3 2011-12-22 1 1410 1410 CAAD8 105 Road Elite Road Aluminum Miami Race Equipment Miami FL
2462 322 1 2011-12-28 1 1250 1250 Synapse Disc Tiagra Road Endurance Road Aluminum Phoenix Bi-peds Phoenix AZ
2463 322 2 2011-12-28 1 2660 2660 Bad Habit 2 Mountain Trail Aluminum Phoenix Bi-peds Phoenix AZ
2464 322 3 2011-12-28 1 2340 2340 F-Si 1 Mountain Cross Country Race Aluminum Phoenix Bi-peds Phoenix AZ
2465 322 4 2011-12-28 1 5860 5860 Synapse Hi-Mod Dura Ace Road Endurance Road Carbon Phoenix Bi-peds Phoenix AZ

2466 rows × 13 columns

# Example 1 - Summarize by time with a DataFrame object, pandas engine
( 
    df 
        .summarize_by_time(
            date_column  = 'order_date', 
            value_column = 'total_price',
            freq         = "MS",
            agg_func     = ['mean', 'sum'],
            engine       = 'pandas'
        )
)
order_date total_price_mean total_price_sum
0 2011-01-01 4600.142857 483015
1 2011-02-01 4611.408730 1162075
2 2011-03-01 5196.653543 659975
3 2011-04-01 4533.846154 1827140
4 2011-05-01 4097.912621 844170
5 2011-06-01 4544.839228 1413445
6 2011-07-01 4976.791667 1194430
7 2011-08-01 4961.970803 679790
8 2011-09-01 4682.298851 814720
9 2011-10-01 3930.053476 734920
10 2011-11-01 4768.175355 1006085
11 2011-12-01 4186.902655 473120
# Example 2 - Summarize by time with a GroupBy object (Wide Format), polars engine
(
    df 
        .groupby(['category_1', 'frame_material']) 
        .summarize_by_time(
            date_column  = 'order_date', 
            value_column = ['total_price', 'quantity'], 
            freq         = 'MS',
            agg_func     = 'sum',
            wide_format  = True, 
            engine       = 'polars'
        )
)
order_date total_price_sum_Mountain_Aluminum total_price_sum_Mountain_Carbon total_price_sum_Road_Aluminum total_price_sum_Road_Carbon quantity_sum_Mountain_Aluminum quantity_sum_Mountain_Carbon quantity_sum_Road_Aluminum quantity_sum_Road_Carbon
0 2011-01-01 66290 155200 61005 200520 34 23 30 41
1 2011-02-01 245115 415440 100480 401040 118 68 52 93
2 2011-03-01 82025 276830 63390 237730 41 46 33 54
3 2011-04-01 340725 735250 197705 553460 164 130 104 144
4 2011-05-01 160130 290310 127600 266130 93 53 75 81
5 2011-06-01 183680 539360 174655 515750 96 91 82 142
6 2011-07-01 186030 581710 98090 328600 94 91 53 82
7 2011-08-01 119785 241470 65855 252680 53 34 36 69
8 2011-09-01 100455 300670 78485 335110 59 47 36 77
9 2011-10-01 105035 272300 83105 274480 61 43 42 71
10 2011-11-01 102045 447300 90050 366690 55 79 51 95
11 2011-12-01 111125 164930 45555 151510 55 27 27 43
# Example 3 - Summarize by time with a GroupBy object (Wide Format)
(
    df 
        .groupby('category_1') 
        .summarize_by_time(
            date_column  = 'order_date', 
            value_column = 'total_price', 
            freq         = 'MS',
            agg_func     = 'sum',
            wide_format  = True,
            engine       = 'pandas' 
        )
)
order_date total_price_Mountain total_price_Road
0 2011-01-01 221490 261525
1 2011-02-01 660555 501520
2 2011-03-01 358855 301120
3 2011-04-01 1075975 751165
4 2011-05-01 450440 393730
5 2011-06-01 723040 690405
6 2011-07-01 767740 426690
7 2011-08-01 361255 318535
8 2011-09-01 401125 413595
9 2011-10-01 377335 357585
10 2011-11-01 549345 456740
11 2011-12-01 276055 197065
# Example 4 - Summarize by time with a GroupBy object and multiple value columns and summaries (Wide Format)
# Note - This example only works with the pandas engine
(
    df 
        .groupby('category_1') 
        .summarize_by_time(
            date_column  = 'order_date', 
            value_column = ['total_price', 'quantity'], 
            freq         = 'MS',
            agg_func     = [
                'sum', 
                'mean', 
                ('q25', lambda x: x.quantile(0.25)), 
                ('q75', lambda x: x.quantile(0.75))
            ],
            wide_format  = False,
            engine       = 'pandas' 
        )
)
category_1 order_date total_price_sum total_price_mean total_price_q25 total_price_q75 quantity_sum quantity_mean quantity_q25 quantity_q75
0 Mountain 2011-01-01 221490 4922.000000 2060.0 6070.0 57 1.266667 1.0 1.0
1 Mountain 2011-02-01 660555 4374.536424 2060.0 5330.0 186 1.231788 1.0 1.0
2 Mountain 2011-03-01 358855 5882.868852 2130.0 6390.0 87 1.426230 1.0 1.0
3 Mountain 2011-04-01 1075975 4890.795455 2060.0 5970.0 294 1.336364 1.0 1.0
4 Mountain 2011-05-01 450440 4549.898990 2010.0 6020.0 146 1.474747 1.0 1.0
5 Mountain 2011-06-01 723040 5021.111111 1950.0 5647.5 187 1.298611 1.0 1.0
6 Mountain 2011-07-01 767740 5444.964539 2130.0 6400.0 185 1.312057 1.0 1.0
7 Mountain 2011-08-01 361255 5734.206349 2235.0 6400.0 87 1.380952 1.0 2.0
8 Mountain 2011-09-01 401125 5077.531646 1620.0 6390.0 106 1.341772 1.0 1.0
9 Mountain 2011-10-01 377335 4439.235294 2160.0 6070.0 104 1.223529 1.0 1.0
10 Mountain 2011-11-01 549345 5282.163462 2340.0 7460.0 134 1.288462 1.0 1.0
11 Mountain 2011-12-01 276055 5208.584906 2060.0 6400.0 82 1.547170 1.0 1.0
12 Road 2011-01-01 261525 4358.750000 1950.0 5605.0 71 1.183333 1.0 1.0
13 Road 2011-02-01 501520 4965.544554 1950.0 5860.0 145 1.435644 1.0 1.0
14 Road 2011-03-01 301120 4562.424242 2240.0 5875.0 87 1.318182 1.0 1.0
15 Road 2011-04-01 751165 4104.726776 1950.0 4800.0 248 1.355191 1.0 1.0
16 Road 2011-05-01 393730 3679.719626 1570.0 3500.0 156 1.457944 1.0 1.0
17 Road 2011-06-01 690405 4134.161677 1840.0 4500.0 224 1.341317 1.0 1.0
18 Road 2011-07-01 426690 4310.000000 1895.0 5330.0 135 1.363636 1.0 1.0
19 Road 2011-08-01 318535 4304.527027 1950.0 4987.5 105 1.418919 1.0 1.0
20 Road 2011-09-01 413595 4353.631579 1950.0 5330.0 113 1.189474 1.0 1.0
21 Road 2011-10-01 357585 3505.735294 1750.0 4260.0 113 1.107843 1.0 1.0
22 Road 2011-11-01 456740 4268.598131 1950.0 4370.0 146 1.364486 1.0 1.0
23 Road 2011-12-01 197065 3284.416667 1652.5 3200.0 70 1.166667 1.0 1.0