pad_by_time

pad_by_time(data, date_column, freq='D', start_date=None, end_date=None)

Make irregular time series regular by padding with missing dates.

The pad_by_time function inserts missing dates into a Pandas DataFrame or DataFrameGroupBy object, through the process making an irregularly spaced time series regularly spaced.

Parameters

Name	Type	Description	Default
`data`	pd.DataFrame or pd.core.groupby.generic.DataFrameGroupBy	The `data` parameter can be either a Pandas DataFrame or a Pandas DataFrameGroupBy object. It represents the data that you want to pad with missing dates.	required
`date_column`	str	The `date_column` parameter is a string that specifies the name of the column in the DataFrame that contains the dates. This column will be used to determine the minimum and maximum dates in theDataFrame, and to generate the regular date range for padding.	required
`freq`	str	The `freq` parameter specifies the frequency at which the missing timestamps should be generated. It accepts a string representing a pandas frequency alias. Some common frequency aliases include: - S: secondly frequency - min: minute frequency - H: hourly frequency - B: business day frequency - D: daily frequency - W: weekly frequency - M: month end frequency - MS: month start frequency - BMS: Business month start - Q: quarter end frequency - QS: quarter start frequency - Y: year end frequency - YS: year start frequency	`'D'`
`start_date`	str	Specifies the start of the padded series. If NULL, it will use the lowest value of the input variable. In the case of groups, it will use the lowest value by group.	`None`
`end_date`	str	Specifies the end of the padded series. If NULL, it will use the highest value of the input variable. In the case of groups, it will use the highest value by group.	`None`

Returns

Type	Description
pd.DataFrame	The function `pad_by_time` returns a Pandas DataFrame that has been extended with future dates.

Notes

Performance

This function uses a number of techniques to speed up computation for large datasets with many time series groups.

We use a vectorized approach to generate the Cartesian product of all unique group values and all dates in the date range.
We then merge this Cartesian product with the original data to introduce NaN values for missing rows. This approach is much faster than looping through each group and applying a function to each group.

Note: There is no parallel processing since the vectorized approach is almost always faster.

Examples

import pandas as pd
import pytimetk as tk

df = tk.load_dataset('stocks_daily', parse_dates = ['date'])
df

	symbol	date	open	high	low	close	volume	adjusted
0	META	2013-01-02	27.440001	28.180000	27.420000	28.000000	69846400	28.000000
1	META	2013-01-03	27.879999	28.469999	27.590000	27.770000	63140600	27.770000
2	META	2013-01-04	28.010000	28.930000	27.830000	28.760000	72715400	28.760000
3	META	2013-01-07	28.690001	29.790001	28.650000	29.420000	83781800	29.420000
4	META	2013-01-08	29.510000	29.600000	28.860001	29.059999	45871300	29.059999
...	...	...	...	...	...	...	...	...
16189	GOOG	2023-09-15	138.800003	139.360001	137.179993	138.300003	48947600	138.300003
16190	GOOG	2023-09-18	137.630005	139.929993	137.630005	138.960007	16233600	138.960007
16191	GOOG	2023-09-19	138.250000	139.175003	137.500000	138.830002	15479100	138.830002
16192	GOOG	2023-09-20	138.830002	138.839996	134.520004	134.589996	21473500	134.589996
16193	GOOG	2023-09-21	132.389999	133.190002	131.089996	131.360001	22042700	131.360001

16194 rows × 8 columns

# Pad Single Time Series: Fill missing dates
padded_df = (
    df
        .query('symbol == "AAPL"')
        .pad_by_time(
            date_column = 'date',
            freq        = 'D'
        )
)
padded_df

	date	symbol	open	high	low	close	volume	adjusted
0	2013-01-02	AAPL	19.779285	19.821428	19.343929	19.608213	560518000.0	16.791180
1	2013-01-03	AAPL	19.567142	19.631071	19.321428	19.360714	352965200.0	16.579241
2	2013-01-04	AAPL	19.177500	19.236786	18.779642	18.821428	594333600.0	16.117437
3	2013-01-05	AAPL	NaN	NaN	NaN	NaN	NaN	NaN
4	2013-01-06	AAPL	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...
3910	2023-09-17	AAPL	NaN	NaN	NaN	NaN	NaN	NaN
3911	2023-09-18	AAPL	176.479996	179.380005	176.169998	177.970001	67257600.0	177.970001
3912	2023-09-19	AAPL	177.520004	179.630005	177.130005	179.070007	51826900.0	179.070007
3913	2023-09-20	AAPL	179.259995	179.699997	175.399994	175.490005	58436200.0	175.490005
3914	2023-09-21	AAPL	174.550003	176.300003	173.860001	173.929993	63047900.0	173.929993

3915 rows × 8 columns

# Pad by Group: Pad each group with missing dates
padded_df = (
    df
        .groupby('symbol')
        .pad_by_time(
            date_column = 'date',
            freq        = 'D'
        )
)
padded_df

	symbol	date	open	high	low	close	volume	adjusted
0	AAPL	2013-01-02	19.779285	19.821428	19.343929	19.608213	560518000.0	16.791180
1	AAPL	2013-01-03	19.567142	19.631071	19.321428	19.360714	352965200.0	16.579241
2	AAPL	2013-01-04	19.177500	19.236786	18.779642	18.821428	594333600.0	16.117437
3	AAPL	2013-01-05	NaN	NaN	NaN	NaN	NaN	NaN
4	AAPL	2013-01-06	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...
23485	NVDA	2023-09-17	NaN	NaN	NaN	NaN	NaN	NaN
23486	NVDA	2023-09-18	427.480011	442.420013	420.000000	439.660004	50027100.0	439.660004
23487	NVDA	2023-09-19	438.329987	439.660004	430.019989	435.200012	37306400.0	435.200012
23488	NVDA	2023-09-20	436.000000	439.029999	422.230011	422.390015	36710800.0	422.390015
23489	NVDA	2023-09-21	415.829987	421.000000	409.799988	410.170013	44893000.0	410.170013

23490 rows × 8 columns

# Pad with end dates specified
padded_df = (
    df
        .groupby('symbol')
        .pad_by_time(
            date_column = 'date',
            freq        = 'D',
            start_date  = '2013-01-01',
            end_date    = '2023-09-22'
        )
)
padded_df.query('symbol == "AAPL"')

	symbol	date	open	high	low	close	volume	adjusted
0	AAPL	2013-01-01	NaN	NaN	NaN	NaN	NaN	NaN
1	AAPL	2013-01-02	19.779285	19.821428	19.343929	19.608213	560518000.0	16.791180
2	AAPL	2013-01-03	19.567142	19.631071	19.321428	19.360714	352965200.0	16.579241
3	AAPL	2013-01-04	19.177500	19.236786	18.779642	18.821428	594333600.0	16.117437
4	AAPL	2013-01-05	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...
3912	AAPL	2023-09-18	176.479996	179.380005	176.169998	177.970001	67257600.0	177.970001
3913	AAPL	2023-09-19	177.520004	179.630005	177.130005	179.070007	51826900.0	179.070007
3914	AAPL	2023-09-20	179.259995	179.699997	175.399994	175.490005	58436200.0	175.490005
3915	AAPL	2023-09-21	174.550003	176.300003	173.860001	173.929993	63047900.0	173.929993
3916	AAPL	2023-09-22	NaN	NaN	NaN	NaN	NaN	NaN

3917 rows × 8 columns