summarize_by_time

summarize_by_time(
    data,
    date_column,
    value_column,
    freq='D',
    agg_func='sum',
    wide_format=False,
    fillna=0,
    engine='pandas',
)

Summarize a DataFrame or GroupBy object by time.

The summarize_by_time function aggregates data by a specified time period and one or more numeric columns, allowing for grouping and customization of the time-based aggregation.

Parameters

Name	Type	Description	Default
data	pd.DataFrame or pd.core.groupby.generic.DataFrameGroupBy	A pandas DataFrame or a pandas GroupBy object. This is the data that you want to summarize by time.	required
date_column	str	The name of the column in the data frame that contains the dates or timestamps to be aggregated by. This column must be of type datetime64.	required
value_column	str or list	The `value_column` parameter is the name of one or more columns in the DataFrame that you want to aggregate by. It can be either a string representing a single column name, or a list of strings representing multiple column names.	required
freq	str	The `freq` parameter specifies the frequency at which the data should be aggregated. It accepts a string representing a pandas frequency offset, such as “D” for daily or “MS” for month start. The default value is “D”, which means the data will be aggregated on a daily basis. Some common frequency aliases include: - S: secondly frequency - min: minute frequency - H: hourly frequency - D: daily frequency - W: weekly frequency - M: month end frequency - MS: month start frequency - Q: quarter end frequency - QS: quarter start frequency - Y: year end frequency - YS: year start frequency	`'D'`
agg_func	list	The `agg_func` parameter is used to specify one or more aggregating functions to apply to the value column(s) during the summarization process. It can be a single function or a list of functions. The default value is `"sum"`, which represents the sum function. Some common aggregating functions include: - “sum”: Sum of values - “mean”: Mean of values - “median”: Median of values - “min”: Minimum of values - “max”: Maximum of values - “std”: Standard deviation of values - “var”: Variance of values - “first”: First value in group - “last”: Last value in group - “count”: Count of values - “nunique”: Number of unique values - “corr”: Correlation between values Pandas Engine Only: Custom `lambda` aggregating functions can be used too. Here are several common examples: - (“q25”, lambda x: x.quantile(0.25)): 25th percentile of values - (“q75”, lambda x: x.quantile(0.75)): 75th percentile of values - (“iqr”, lambda x: x.quantile(0.75) - x.quantile(0.25)): Interquartile range of values - (“range”, lambda x: x.max() - x.min()): Range of values	`'sum'`
wide_format	bool	A boolean parameter that determines whether the output should be in “wide” or “long” format. If set to `True`, the output will be in wide format, where each group is represented by a separate column. If set to False, the output will be in long format, where each group is represented by a separate row. The default value is `False`.	`False`
fillna	int	The `fillna` parameter is used to specify the value to fill missing data with. By default, it is set to 0. If you want to keep missing values as NaN, you can use `np.nan` as the value for `fillna`.	`0`
engine	str	The `engine` parameter is used to specify the engine to use for summarizing the data. It can be either “pandas” or “polars”. - The default value is “pandas”. - When “polars”, the function will internally use the `polars` library for summarizing the data. This can be faster than using “pandas” for large datasets.	`'pandas'`

Returns

Name	Type	Description
	pd.DataFrame	A Pandas DataFrame that is summarized by time.

Examples

import pytimetk as tk
import pandas as pd

df = tk.load_dataset('bike_sales_sample', parse_dates = ['order_date'])

df

	order_id	order_line	order_date	quantity	price	total_price	model	category_1	category_2	frame_material	bikeshop_name	city	state
0	1	1	2011-01-07	1	6070	6070	Jekyll Carbon 2	Mountain	Over Mountain	Carbon	Ithaca Mountain Climbers	Ithaca	NY
1	1	2	2011-01-07	1	5970	5970	Trigger Carbon 2	Mountain	Over Mountain	Carbon	Ithaca Mountain Climbers	Ithaca	NY
2	2	1	2011-01-10	1	2770	2770	Beast of the East 1	Mountain	Trail	Aluminum	Kansas City 29ers	Kansas City	KS
3	2	2	2011-01-10	1	5970	5970	Trigger Carbon 2	Mountain	Over Mountain	Carbon	Kansas City 29ers	Kansas City	KS
4	3	1	2011-01-10	1	10660	10660	Supersix Evo Hi-Mod Team	Road	Elite Road	Carbon	Louisville Race Equipment	Louisville	KY
...	...	...	...	...	...	...	...	...	...	...	...	...	...
2461	321	3	2011-12-22	1	1410	1410	CAAD8 105	Road	Elite Road	Aluminum	Miami Race Equipment	Miami	FL
2462	322	1	2011-12-28	1	1250	1250	Synapse Disc Tiagra	Road	Endurance Road	Aluminum	Phoenix Bi-peds	Phoenix	AZ
2463	322	2	2011-12-28	1	2660	2660	Bad Habit 2	Mountain	Trail	Aluminum	Phoenix Bi-peds	Phoenix	AZ
2464	322	3	2011-12-28	1	2340	2340	F-Si 1	Mountain	Cross Country Race	Aluminum	Phoenix Bi-peds	Phoenix	AZ
2465	322	4	2011-12-28	1	5860	5860	Synapse Hi-Mod Dura Ace	Road	Endurance Road	Carbon	Phoenix Bi-peds	Phoenix	AZ

2466 rows × 13 columns

# Example 1 - Summarize by time with a DataFrame object, pandas engine
(
    df
        .summarize_by_time(
            date_column  = 'order_date',
            value_column = 'total_price',
            freq         = "MS",
            agg_func     = ['mean', 'sum'],
            engine       = 'pandas'
        )
)

	order_date	total_price_mean	total_price_sum
0	2011-01-01	4600.142857	483015
1	2011-02-01	4611.408730	1162075
2	2011-03-01	5196.653543	659975
3	2011-04-01	4533.846154	1827140
4	2011-05-01	4097.912621	844170
5	2011-06-01	4544.839228	1413445
6	2011-07-01	4976.791667	1194430
7	2011-08-01	4961.970803	679790
8	2011-09-01	4682.298851	814720
9	2011-10-01	3930.053476	734920
10	2011-11-01	4768.175355	1006085
11	2011-12-01	4186.902655	473120

# Example 2 - Summarize by time with a GroupBy object (Wide Format), polars engine
(
    df
        .groupby(['category_1', 'frame_material'])
        .summarize_by_time(
            date_column  = 'order_date',
            value_column = ['total_price', 'quantity'],
            freq         = 'MS',
            agg_func     = 'sum',
            wide_format  = True,
            engine       = 'polars'
        )
)

	order_date	total_price_Mountain_Aluminum	total_price_Mountain_Carbon	total_price_Road_Aluminum	total_price_Road_Carbon	quantity_Mountain_Aluminum	quantity_Mountain_Carbon	quantity_Road_Aluminum	quantity_Road_Carbon
0	2011-01-01	66290	155200	61005	200520	34	23	30	41
1	2011-02-01	245115	415440	100480	401040	118	68	52	93
2	2011-03-01	82025	276830	63390	237730	41	46	33	54
3	2011-04-01	340725	735250	197705	553460	164	130	104	144
4	2011-05-01	160130	290310	127600	266130	93	53	75	81
5	2011-06-01	183680	539360	174655	515750	96	91	82	142
6	2011-07-01	186030	581710	98090	328600	94	91	53	82
7	2011-08-01	119785	241470	65855	252680	53	34	36	69
8	2011-09-01	100455	300670	78485	335110	59	47	36	77
9	2011-10-01	105035	272300	83105	274480	61	43	42	71
10	2011-11-01	102045	447300	90050	366690	55	79	51	95
11	2011-12-01	111125	164930	45555	151510	55	27	27	43

# Example 3 - Summarize by time with a GroupBy object (Wide Format)
(
    df
        .groupby('category_1')
        .summarize_by_time(
            date_column  = 'order_date',
            value_column = 'total_price',
            freq         = 'MS',
            agg_func     = 'sum',
            wide_format  = True,
            engine       = 'pandas'
        )
)

	order_date	total_price_Mountain	total_price_Road
0	2011-01-01	221490	261525
1	2011-02-01	660555	501520
2	2011-03-01	358855	301120
3	2011-04-01	1075975	751165
4	2011-05-01	450440	393730
5	2011-06-01	723040	690405
6	2011-07-01	767740	426690
7	2011-08-01	361255	318535
8	2011-09-01	401125	413595
9	2011-10-01	377335	357585
10	2011-11-01	549345	456740
11	2011-12-01	276055	197065

# Example 4 - Summarize by time with a GroupBy object and multiple value columns and summaries (Wide Format)
# Note - This example only works with the pandas engine
(
    df
        .groupby('category_1')
        .summarize_by_time(
            date_column  = 'order_date',
            value_column = ['total_price', 'quantity'],
            freq         = 'MS',
            agg_func     = [
                'sum',
                'mean',
                ('q25', lambda x: x.quantile(0.25)),
                ('q75', lambda x: x.quantile(0.75))
            ],
            wide_format  = False,
            engine       = 'pandas'
        )
)

	category_1	order_date	total_price_sum	total_price_mean	total_price_q25	total_price_q75	quantity_sum	quantity_mean	quantity_q25	quantity_q75
0	Mountain	2011-01-01	221490	4922.000000	2060.0	6070.0	57	1.266667	1.0	1.0
1	Mountain	2011-02-01	660555	4374.536424	2060.0	5330.0	186	1.231788	1.0	1.0
2	Mountain	2011-03-01	358855	5882.868852	2130.0	6390.0	87	1.426230	1.0	1.0
3	Mountain	2011-04-01	1075975	4890.795455	2060.0	5970.0	294	1.336364	1.0	1.0
4	Mountain	2011-05-01	450440	4549.898990	2010.0	6020.0	146	1.474747	1.0	1.0
5	Mountain	2011-06-01	723040	5021.111111	1950.0	5647.5	187	1.298611	1.0	1.0
6	Mountain	2011-07-01	767740	5444.964539	2130.0	6400.0	185	1.312057	1.0	1.0
7	Mountain	2011-08-01	361255	5734.206349	2235.0	6400.0	87	1.380952	1.0	2.0
8	Mountain	2011-09-01	401125	5077.531646	1620.0	6390.0	106	1.341772	1.0	1.0
9	Mountain	2011-10-01	377335	4439.235294	2160.0	6070.0	104	1.223529	1.0	1.0
10	Mountain	2011-11-01	549345	5282.163462	2340.0	7460.0	134	1.288462	1.0	1.0
11	Mountain	2011-12-01	276055	5208.584906	2060.0	6400.0	82	1.547170	1.0	1.0
12	Road	2011-01-01	261525	4358.750000	1950.0	5605.0	71	1.183333	1.0	1.0
13	Road	2011-02-01	501520	4965.544554	1950.0	5860.0	145	1.435644	1.0	1.0
14	Road	2011-03-01	301120	4562.424242	2240.0	5875.0	87	1.318182	1.0	1.0
15	Road	2011-04-01	751165	4104.726776	1950.0	4800.0	248	1.355191	1.0	1.0
16	Road	2011-05-01	393730	3679.719626	1570.0	3500.0	156	1.457944	1.0	1.0
17	Road	2011-06-01	690405	4134.161677	1840.0	4500.0	224	1.341317	1.0	1.0
18	Road	2011-07-01	426690	4310.000000	1895.0	5330.0	135	1.363636	1.0	1.0
19	Road	2011-08-01	318535	4304.527027	1950.0	4987.5	105	1.418919	1.0	1.0
20	Road	2011-09-01	413595	4353.631579	1950.0	5330.0	113	1.189474	1.0	1.0
21	Road	2011-10-01	357585	3505.735294	1750.0	4260.0	113	1.107843	1.0	1.0
22	Road	2011-11-01	456740	4268.598131	1950.0	4370.0	146	1.364486	1.0	1.0
23	Road	2011-12-01	197065	3284.416667	1652.5	3200.0	70	1.166667	1.0	1.0