Make irregular time series regular by padding with missing dates.
The pad_by_time function inserts missing dates into a Pandas DataFrame or DataFrameGroupBy object, through the process making an irregularly spaced time series regularly spaced.
Parameters
Name
Type
Description
Default
data
pd.DataFrame or pd.core.groupby.generic.DataFrameGroupBy
The data parameter can be either a Pandas DataFrame or a Pandas DataFrameGroupBy object. It represents the data that you want to pad with missing dates.
required
date_column
str
The date_column parameter is a string that specifies the name of the column in the DataFrame that contains the dates. This column will be used to determine the minimum and maximum dates in theDataFrame, and to generate the regular date range for padding.
required
freq
str
The freq parameter specifies the frequency at which the missing timestamps should be generated. It accepts a string representing a pandas frequency alias. Some common frequency aliases include: - S: secondly frequency - min: minute frequency - H: hourly frequency - B: business day frequency - D: daily frequency - W: weekly frequency - M: month end frequency - MS: month start frequency - BMS: Business month start - Q: quarter end frequency - QS: quarter start frequency - Y: year end frequency - YS: year start frequency
'D'
start_date
str
Specifies the start of the padded series. If NULL, it will use the lowest value of the input variable. In the case of groups, it will use the lowest value by group.
None
end_date
str
Specifies the end of the padded series. If NULL, it will use the highest value of the input variable. In the case of groups, it will use the highest value by group.
None
Returns
Type
Description
pd.DataFrame
The function pad_by_time returns a Pandas DataFrame that has been extended with future dates.
Notes
Performance
This function uses a number of techniques to speed up computation for large datasets with many time series groups.
We use a vectorized approach to generate the Cartesian product of all unique group values and all dates in the date range.
We then merge this Cartesian product with the original data to introduce NaN values for missing rows. This approach is much faster than looping through each group and applying a function to each group.
Note: There is no parallel processing since the vectorized approach is almost always faster.
Examples
import pandas as pdimport pytimetk as tkdf = tk.load_dataset('stocks_daily', parse_dates = ['date'])df
symbol
date
open
high
low
close
volume
adjusted
0
META
2013-01-02
27.440001
28.180000
27.420000
28.000000
69846400
28.000000
1
META
2013-01-03
27.879999
28.469999
27.590000
27.770000
63140600
27.770000
2
META
2013-01-04
28.010000
28.930000
27.830000
28.760000
72715400
28.760000
3
META
2013-01-07
28.690001
29.790001
28.650000
29.420000
83781800
29.420000
4
META
2013-01-08
29.510000
29.600000
28.860001
29.059999
45871300
29.059999
...
...
...
...
...
...
...
...
...
16189
GOOG
2023-09-15
138.800003
139.360001
137.179993
138.300003
48947600
138.300003
16190
GOOG
2023-09-18
137.630005
139.929993
137.630005
138.960007
16233600
138.960007
16191
GOOG
2023-09-19
138.250000
139.175003
137.500000
138.830002
15479100
138.830002
16192
GOOG
2023-09-20
138.830002
138.839996
134.520004
134.589996
21473500
134.589996
16193
GOOG
2023-09-21
132.389999
133.190002
131.089996
131.360001
22042700
131.360001
16194 rows Γ 8 columns
# Pad Single Time Series: Fill missing datespadded_df = ( df .query('symbol == "AAPL"') .pad_by_time( date_column ='date', freq ='D' ))padded_df
date
symbol
open
high
low
close
volume
adjusted
0
2013-01-02
AAPL
19.779285
19.821428
19.343929
19.608213
560518000.0
16.791180
1
2013-01-03
AAPL
19.567142
19.631071
19.321428
19.360714
352965200.0
16.579241
2
2013-01-04
AAPL
19.177500
19.236786
18.779642
18.821428
594333600.0
16.117437
3
2013-01-05
AAPL
NaN
NaN
NaN
NaN
NaN
NaN
4
2013-01-06
AAPL
NaN
NaN
NaN
NaN
NaN
NaN
...
...
...
...
...
...
...
...
...
3910
2023-09-17
AAPL
NaN
NaN
NaN
NaN
NaN
NaN
3911
2023-09-18
AAPL
176.479996
179.380005
176.169998
177.970001
67257600.0
177.970001
3912
2023-09-19
AAPL
177.520004
179.630005
177.130005
179.070007
51826900.0
179.070007
3913
2023-09-20
AAPL
179.259995
179.699997
175.399994
175.490005
58436200.0
175.490005
3914
2023-09-21
AAPL
174.550003
176.300003
173.860001
173.929993
63047900.0
173.929993
3915 rows Γ 8 columns
# Pad by Group: Pad each group with missing datespadded_df = ( df .groupby('symbol') .pad_by_time( date_column ='date', freq ='D' ))padded_df
symbol
date
open
high
low
close
volume
adjusted
0
AAPL
2013-01-02
19.779285
19.821428
19.343929
19.608213
560518000.0
16.791180
1
AAPL
2013-01-03
19.567142
19.631071
19.321428
19.360714
352965200.0
16.579241
2
AAPL
2013-01-04
19.177500
19.236786
18.779642
18.821428
594333600.0
16.117437
3
AAPL
2013-01-05
NaN
NaN
NaN
NaN
NaN
NaN
4
AAPL
2013-01-06
NaN
NaN
NaN
NaN
NaN
NaN
...
...
...
...
...
...
...
...
...
23485
NVDA
2023-09-17
NaN
NaN
NaN
NaN
NaN
NaN
23486
NVDA
2023-09-18
427.480011
442.420013
420.000000
439.660004
50027100.0
439.660004
23487
NVDA
2023-09-19
438.329987
439.660004
430.019989
435.200012
37306400.0
435.200012
23488
NVDA
2023-09-20
436.000000
439.029999
422.230011
422.390015
36710800.0
422.390015
23489
NVDA
2023-09-21
415.829987
421.000000
409.799988
410.170013
44893000.0
410.170013
23490 rows Γ 8 columns
# Pad with end dates specifiedpadded_df = ( df .groupby('symbol') .pad_by_time( date_column ='date', freq ='D', start_date ='2013-01-01', end_date ='2023-09-22' ))padded_df.query('symbol == "AAPL"')