Calendar Features
Matt Dancho
20220818
Source:vignettes/TK01_Working_With_Time_Series_Index.Rmd
TK01_Working_With_Time_Series_Index.Rmd
This vignette covers making and working with Calendar Features, which are derived from a time series index, or the sequence of date/datetime stamps that accompany time series data.
Introduction
The time series index consists of a collection of timebased values that define when each observation occurred, is the most important part of a time series object.
The index gives the user a lot of information in a simple timestamp. Consider the datetime “20160101 00:00:00”.
From this timestamp, we can decompose the date and time information to get the signature, which consists of the year, quarter, month, day, day of year, day of month, hour, minute, and second of the occurrence of a single observation. Further, the difference between two or more observations is the frequency from which we can obtain even more information such as the periodicity of the data and whether or not these observations are on a regular interval. This information is critical as it provides the basis for performance over time in finance, decay rates in biology, growth rates in economics, and so on.
In this vignette the user will be exposed to:
 Time Series Index
 Time Series Signature
 Time Series Summary
Data
We’ll use the Facebook stock prices from the FANG
data set from tidyquant
. These are the historical stock prices (open, high, low, close, volume, and adjusted) for the “FB” stock from 2013 through 2016.
## # A tibble: 1,008 × 8
## symbol date open high low close volume adjusted
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 FB 20130102 27.4 28.2 27.4 28 69846400 28
## 2 FB 20130103 27.9 28.5 27.6 27.8 63140600 27.8
## 3 FB 20130104 28.0 28.9 27.8 28.8 72715400 28.8
## 4 FB 20130107 28.7 29.8 28.6 29.4 83781800 29.4
## 5 FB 20130108 29.5 29.6 28.9 29.1 45871300 29.1
## 6 FB 20130109 29.7 30.6 29.5 30.6 104787700 30.6
## 7 FB 20130110 30.6 31.5 30.3 31.3 95316400 31.3
## 8 FB 20130111 31.3 32.0 31.1 31.7 89598000 31.7
## 9 FB 20130114 32.1 32.2 30.6 31.0 98892800 31.0
## 10 FB 20130115 30.6 31.7 29.9 30.1 173242600 30.1
## # … with 998 more rows
## # ℹ Use `print(n = ...)` to see more rows
To simplify the tutorial, we will select only the “date” and “volume” columns. For the FB_vol_date
data frame, we can see from the “date” column that the observations are daily beginning on the second day of 2013.
## # A tibble: 1,008 × 2
## date volume
## <date> <dbl>
## 1 20130102 69846400
## 2 20130103 63140600
## 3 20130104 72715400
## 4 20130107 83781800
## 5 20130108 45871300
## 6 20130109 104787700
## 7 20130110 95316400
## 8 20130111 89598000
## 9 20130114 98892800
## 10 20130115 173242600
## # … with 998 more rows
## # ℹ Use `print(n = ...)` to see more rows
Time Series Index
Before we can analyze an index, we need to extract it from the object. The function tk_index()
extracts the index from any time series object including data frame (or tbl
), xts
, zoo
, etc. The index is always returned in the native date, datetime, yearmon, or yearqtr format. Note that the index must be in one of these timebased classes for extraction to work:
 datetimes: Must inherit
POSIXt
 dates: Must inherit
Date
 yearmon: Must inherit
yearmon
from thezoo
package  yearqtr: Must inherit
yearqtr
from thezoo
package
Extract the index using tk_index()
. The structure is shown to see the output format, which is a vector of dates.
## Date[1:1008], format: "20130102" "20130103" "20130104" "20130107" "20130108" ...
Time Series Signature
The index can be decomposed into a signature. The time series signature is a unique set of properties of the time series values that describe the time series.
Get Functions  Turning an Index into Information
The function tk_get_timeseries_signature()
can be used to convert the index to a tibble containing the following values (columns):
 index: The index value that was decomposed

index.num: The numeric value of the index in seconds. The base is “19700101 00:00:00” (Execute
"19700101 00:00:00" %>% ymd_hms() %>% as.numeric()
to see the value returned is zero). Every time series value after this date can be converted to a numeric value in seconds.  diff: The difference in seconds from the previous numeric index value.
 year: The year component of the index.
 year.iso: The ISO year number of the year (Monday start).
 half: The half component of the index.
 quarter: The quarter component of the index.
 month: The month component of the index with base 1.

month.xts: The month component of the index with base 0, which is what
xts
implements.  month.lbl: The month label as an ordered factor begining with January and ending with December.
 day: The day component of the index.
 hour: The hour component of the index.
 minute: The minute component of the index.
 second: The second component of the index.
 hour12: The hour component on a 12 hour scale.
 am.pm: Morning (AM) = 1, Afternoon (PM) = 2.
 wday: The day of the week with base 1. Sunday = 1 and Saturday = 7.

wday.xts: The day of the week with base 0, which is what
xts
implements. Sunday = 0 and Saturday = 6.  wday.lbl: The day of the week label as an ordered factor begining with Sunday and ending with Saturday.
 mday: The day of the month.
 qday: The day of the quarter.
 yday: The day of the year.
 mweek: The week of the month.
 week: The week number of the year (Sunday start).
 week.iso: The ISO week number of the year (Monday start).
 week2: The modulus for biweekly frequency.
 week3: The modulus for triweekly frequency.
 week4: The modulus for quadweekly frequency.
 mday7: The integer division of day of the month by seven, which returns the first, second, third, … instance the day has appeared in the month. Values begin at 1. For example, the first Saturday in the month has mday7 = 1. The second has mday7 = 2.
# idx_date signature
tk_get_timeseries_signature(idx_date)
## # A tibble: 1,008 × 29
## index index.num diff year year.…¹ half quarter month month…² month…³
## <date> <dbl> <dbl> <int> <int> <int> <int> <int> <int> <ord>
## 1 20130102 1.36e9 NA 2013 2013 1 1 1 0 January
## 2 20130103 1.36e9 86400 2013 2013 1 1 1 0 January
## 3 20130104 1.36e9 86400 2013 2013 1 1 1 0 January
## 4 20130107 1.36e9 259200 2013 2013 1 1 1 0 January
## 5 20130108 1.36e9 86400 2013 2013 1 1 1 0 January
## 6 20130109 1.36e9 86400 2013 2013 1 1 1 0 January
## 7 20130110 1.36e9 86400 2013 2013 1 1 1 0 January
## 8 20130111 1.36e9 86400 2013 2013 1 1 1 0 January
## 9 20130114 1.36e9 259200 2013 2013 1 1 1 0 January
## 10 20130115 1.36e9 86400 2013 2013 1 1 1 0 January
## # … with 998 more rows, 19 more variables: day <int>, hour <int>, minute <int>,
## # second <int>, hour12 <int>, am.pm <int>, wday <int>, wday.xts <int>,
## # wday.lbl <ord>, mday <int>, qday <int>, yday <int>, mweek <int>,
## # week <int>, week.iso <int>, week2 <int>, week3 <int>, week4 <int>,
## # mday7 <int>, and abbreviated variable names ¹year.iso, ²month.xts,
## # ³month.lbl
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Augment Functions (Adding Many Features to a Data Frame)
It’s usually important to keep the index signature with the values (e.g. volume in our example). We can use an expedited approach with tk_augment_timeseries_signature()
, which adds the signature to the end of the time series object.
# Augmenting a data frame
FB_vol_date_signature < FB_vol_date %>% tk_augment_timeseries_signature(.date_var = date)
FB_vol_date_signature
## # A tibble: 1,008 × 30
## date volume index…¹ diff year year.…² half quarter month month…³
## <date> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <int>
## 1 20130102 69846400 1.36e9 NA 2013 2013 1 1 1 0
## 2 20130103 63140600 1.36e9 86400 2013 2013 1 1 1 0
## 3 20130104 72715400 1.36e9 86400 2013 2013 1 1 1 0
## 4 20130107 83781800 1.36e9 259200 2013 2013 1 1 1 0
## 5 20130108 45871300 1.36e9 86400 2013 2013 1 1 1 0
## 6 20130109 104787700 1.36e9 86400 2013 2013 1 1 1 0
## 7 20130110 95316400 1.36e9 86400 2013 2013 1 1 1 0
## 8 20130111 89598000 1.36e9 86400 2013 2013 1 1 1 0
## 9 20130114 98892800 1.36e9 259200 2013 2013 1 1 1 0
## 10 20130115 173242600 1.36e9 86400 2013 2013 1 1 1 0
## # … with 998 more rows, 20 more variables: month.lbl <ord>, day <int>,
## # hour <int>, minute <int>, second <int>, hour12 <int>, am.pm <int>,
## # wday <int>, wday.xts <int>, wday.lbl <ord>, mday <int>, qday <int>,
## # yday <int>, mweek <int>, week <int>, week.iso <int>, week2 <int>,
## # week3 <int>, week4 <int>, mday7 <int>, and abbreviated variable names
## # ¹index.num, ²year.iso, ³month.xts
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Modeling is now much easier. As an example, we can use linear regression model using the lm()
function with the month and year as a predictor of volume.
# Example Benefit 2: Modeling is easier
fit < lm(volume ~ year + month.lbl, data = FB_vol_date_signature)
summary(fit)
##
## Call:
## lm(formula = volume ~ year + month.lbl, data = FB_vol_date_signature)
##
## Residuals:
## Min 1Q Median 3Q Max
## 51042223 13528407 4588594 8296073 304011277
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 2.494e+10 1.414e+09 17.633 < 2e16 ***
## year 1.236e+07 7.021e+05 17.604 < 2e16 ***
## month.lbl.L 9.589e+06 2.740e+06 3.499 0.000488 ***
## month.lbl.Q 7.348e+06 2.725e+06 2.697 0.007122 **
## month.lbl.C 9.773e+06 2.711e+06 3.605 0.000328 ***
## month.lbl^4 2.885e+06 2.720e+06 1.060 0.289176
## month.lbl^5 2.994e+06 2.749e+06 1.089 0.276428
## month.lbl^6 3.169e+06 2.753e+06 1.151 0.249851
## month.lbl^7 6.000e+05 2.721e+06 0.221 0.825514
## month.lbl^8 8.281e+03 2.702e+06 0.003 0.997555
## month.lbl^9 9.504e+06 2.704e+06 3.515 0.000459 ***
## month.lbl^10 5.911e+06 2.701e+06 2.188 0.028888 *
## month.lbl^11 4.738e+06 2.696e+06 1.757 0.079181 .
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24910000 on 995 degrees of freedom
## Multiple Rsquared: 0.2714, Adjusted Rsquared: 0.2626
## Fstatistic: 30.89 on 12 and 995 DF, pvalue: < 2.2e16
Time Series Summary
The next index analysis tool is the summary metrics, which can be retrieved using the tk_get_timeseries_summary()
function. The summary reports the following attributes as a singlerow tibble.
General Summary:
The first six columns are general summary information.
 n.obs: The total number of observations
 start: The start in the appropriate time class
 end: The end in the appropriate time class
 units: A label that describes the unit of the index value that is independent of frequency (i.e. a date class will always be “days” whereas a datetime class will always be “seconds”). Values can be days, hours, mins, secs.
 scale: A label that describes the the median difference (frequency) between observations. Values can be quarter, month, day, hour, minute, second.
 tzone: The timezone of the index.
# idx_date: First six columns, general summary
tk_get_timeseries_summary(idx_date)[,1:6]
## # A tibble: 1 × 6
## n.obs start end units scale tzone
## <int> <date> <date> <chr> <chr> <chr>
## 1 1008 20130102 20161230 days day UTC
Differences Summary:
The next group of values are the differences summary (i.e. summary of frequency). All values are in seconds:
 diff.minimum: The minimum difference between index values.
 diff.q1: The first quartile of the index differences.
 diff.median: The median difference between index values (i.e. most common frequency).
 diff.mean: The average difference between index values.
 diff.q3: The third quartile of the index differences.
 diff.maximum: The maximum difference between index values.
# idx_date: Last six columns, difference summary
tk_get_timeseries_summary(idx_date)[,7:12]
## # A tibble: 1 × 6
## diff.minimum diff.q1 diff.median diff.mean diff.q3 diff.maximum
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 86400 86400 86400 125096. 86400 345600
The differences provide information about the regularity of the frequency. Generally speaking if all difference values are equal, the index is regular. However, scales beyond “day” are never theoretically regular since the differences in seconds are not equivalent. However, conceptually monthly, quarterly and yearly data can be thought of as regular if the index contains consecutive months, quarters, or years, respectively. Therefore, the difference attributes are most meaningful for daily and lower time scales because the difference summary always indicates level of regularity.
From the second group (differences summary), we immediately recognize that the mean is different than the median and therefore the index is irregular (meaning certain days are missing). Further we can see that the maximum difference is 345,600 seconds, indicating the maximum difference is 4 days (345,600 seconds / 86400 seconds/day).
Learning More
My Talk on HighPerformance Time Series Forecasting
Time series is changing. Businesses now need 10,000+ time series forecasts every day.
HighPerformance Forecasting Systems will save companies MILLIONS of dollars. Imagine what will happen to your career if you can provide your organization a “HighPerformance Time Series Forecasting System” (HPTSF System).
I teach how to build a HPTFS System in my HighPerformance Time Series Forecasting Course. If interested in learning Scalable HighPerformance Forecasting Strategies then take my course. You will learn:
 Time Series Machine Learning (cuttingedge) with
Modeltime
 30+ Models (Prophet, ARIMA, XGBoost, Random Forest, & many more)  NEW  Deep Learning with
GluonTS
(Competition Winners)  Time Series Preprocessing, Noise Reduction, & Anomaly Detection
 Feature engineering using lagged variables & external regressors
 Hyperparameter Tuning
 Time series crossvalidation
 Ensembling Multiple Machine Learning & Univariate Modeling Techniques (Competition Winner)
 Scalable Forecasting  Forecast 1000+ time series in parallel
 and more.