pad_by_time

pad_by_time(data, date_column, freq='D', start_date=None, end_date=None)

Make irregular time series regular by padding with missing dates.

The pad_by_time function inserts missing dates into a Pandas DataFrame or DataFrameGroupBy object, through the process making an irregularly spaced time series regularly spaced.

Parameters

Name Type Description Default
data pd.DataFrame or pd.core.groupby.generic.DataFrameGroupBy The data parameter can be either a Pandas DataFrame or a Pandas DataFrameGroupBy object. It represents the data that you want to pad with missing dates. required
date_column str The date_column parameter is a string that specifies the name of the column in the DataFrame that contains the dates. This column will be used to determine the minimum and maximum dates in theDataFrame, and to generate the regular date range for padding. required
freq str The freq parameter specifies the frequency at which the missing timestamps should be generated. It accepts a string representing a pandas frequency alias. Some common frequency aliases include: - S: secondly frequency - min: minute frequency - H: hourly frequency - B: business day frequency - D: daily frequency - W: weekly frequency - M: month end frequency - MS: month start frequency - BMS: Business month start - Q: quarter end frequency - QS: quarter start frequency - Y: year end frequency - YS: year start frequency 'D'
start_date str Specifies the start of the padded series. If NULL, it will use the lowest value of the input variable. In the case of groups, it will use the lowest value by group. None
end_date str Specifies the end of the padded series. If NULL, it will use the highest value of the input variable. In the case of groups, it will use the highest value by group. None

Returns

Type Description
pd.DataFrame The function pad_by_time returns a Pandas DataFrame that has been extended with future dates.

Notes

Performance

This function uses a number of techniques to speed up computation for large datasets with many time series groups.

  • We use a vectorized approach to generate the Cartesian product of all unique group values and all dates in the date range.
  • We then merge this Cartesian product with the original data to introduce NaN values for missing rows. This approach is much faster than looping through each group and applying a function to each group.

Note: There is no parallel processing since the vectorized approach is almost always faster.

Examples

import pandas as pd
import pytimetk as tk

df = tk.load_dataset('stocks_daily', parse_dates = ['date'])
df
symbol date open high low close volume adjusted
0 META 2013-01-02 27.440001 28.180000 27.420000 28.000000 69846400 28.000000
1 META 2013-01-03 27.879999 28.469999 27.590000 27.770000 63140600 27.770000
2 META 2013-01-04 28.010000 28.930000 27.830000 28.760000 72715400 28.760000
3 META 2013-01-07 28.690001 29.790001 28.650000 29.420000 83781800 29.420000
4 META 2013-01-08 29.510000 29.600000 28.860001 29.059999 45871300 29.059999
... ... ... ... ... ... ... ... ...
16189 GOOG 2023-09-15 138.800003 139.360001 137.179993 138.300003 48947600 138.300003
16190 GOOG 2023-09-18 137.630005 139.929993 137.630005 138.960007 16233600 138.960007
16191 GOOG 2023-09-19 138.250000 139.175003 137.500000 138.830002 15479100 138.830002
16192 GOOG 2023-09-20 138.830002 138.839996 134.520004 134.589996 21473500 134.589996
16193 GOOG 2023-09-21 132.389999 133.190002 131.089996 131.360001 22042700 131.360001

16194 rows Γ— 8 columns

# Pad Single Time Series: Fill missing dates
padded_df = (
    df
        .query('symbol == "AAPL"')
        .pad_by_time(
            date_column = 'date',
            freq        = 'D'
        )
)
padded_df 
date symbol open high low close volume adjusted
0 2013-01-02 AAPL 19.779285 19.821428 19.343929 19.608213 560518000.0 16.791180
1 2013-01-03 AAPL 19.567142 19.631071 19.321428 19.360714 352965200.0 16.579241
2 2013-01-04 AAPL 19.177500 19.236786 18.779642 18.821428 594333600.0 16.117437
3 2013-01-05 AAPL NaN NaN NaN NaN NaN NaN
4 2013-01-06 AAPL NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ...
3910 2023-09-17 AAPL NaN NaN NaN NaN NaN NaN
3911 2023-09-18 AAPL 176.479996 179.380005 176.169998 177.970001 67257600.0 177.970001
3912 2023-09-19 AAPL 177.520004 179.630005 177.130005 179.070007 51826900.0 179.070007
3913 2023-09-20 AAPL 179.259995 179.699997 175.399994 175.490005 58436200.0 175.490005
3914 2023-09-21 AAPL 174.550003 176.300003 173.860001 173.929993 63047900.0 173.929993

3915 rows Γ— 8 columns

# Pad by Group: Pad each group with missing dates
padded_df = (
    df
        .groupby('symbol')
        .pad_by_time(
            date_column = 'date',
            freq        = 'D'
        )
)
padded_df
symbol date open high low close volume adjusted
0 AAPL 2013-01-02 19.779285 19.821428 19.343929 19.608213 560518000.0 16.791180
1 AAPL 2013-01-03 19.567142 19.631071 19.321428 19.360714 352965200.0 16.579241
2 AAPL 2013-01-04 19.177500 19.236786 18.779642 18.821428 594333600.0 16.117437
3 AAPL 2013-01-05 NaN NaN NaN NaN NaN NaN
4 AAPL 2013-01-06 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ...
23485 NVDA 2023-09-17 NaN NaN NaN NaN NaN NaN
23486 NVDA 2023-09-18 427.480011 442.420013 420.000000 439.660004 50027100.0 439.660004
23487 NVDA 2023-09-19 438.329987 439.660004 430.019989 435.200012 37306400.0 435.200012
23488 NVDA 2023-09-20 436.000000 439.029999 422.230011 422.390015 36710800.0 422.390015
23489 NVDA 2023-09-21 415.829987 421.000000 409.799988 410.170013 44893000.0 410.170013

23490 rows Γ— 8 columns

# Pad with end dates specified
padded_df = (
    df
        .groupby('symbol')
        .pad_by_time(
            date_column = 'date',
            freq        = 'D',
            start_date  = '2013-01-01',
            end_date    = '2023-09-22'
        )
)
padded_df.query('symbol == "AAPL"')
symbol date open high low close volume adjusted
0 AAPL 2013-01-01 NaN NaN NaN NaN NaN NaN
1 AAPL 2013-01-02 19.779285 19.821428 19.343929 19.608213 560518000.0 16.791180
2 AAPL 2013-01-03 19.567142 19.631071 19.321428 19.360714 352965200.0 16.579241
3 AAPL 2013-01-04 19.177500 19.236786 18.779642 18.821428 594333600.0 16.117437
4 AAPL 2013-01-05 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ...
3912 AAPL 2023-09-18 176.479996 179.380005 176.169998 177.970001 67257600.0 177.970001
3913 AAPL 2023-09-19 177.520004 179.630005 177.130005 179.070007 51826900.0 179.070007
3914 AAPL 2023-09-20 179.259995 179.699997 175.399994 175.490005 58436200.0 175.490005
3915 AAPL 2023-09-21 174.550003 176.300003 173.860001 173.929993 63047900.0 173.929993
3916 AAPL 2023-09-22 NaN NaN NaN NaN NaN NaN

3917 rows Γ— 8 columns