Skip to contents

Clustering is an important part of time series analysis that allows us to organize time series into groups by combining “tsfeatures” (summary matricies) with unsupervised techniques such as K-Means Clustering. In this short tutorial, we will cover the tk_tsfeatures() functions that computes a time series feature matrix of summarized information on one or more time series.

Libraries

To get started, load the following libraries.

Data

This tutorial will use the walmart_sales_weekly dataset:

  • Weekly
  • Sales spikes at various events
walmart_sales_weekly
## # A tibble: 1,001 × 17
##    id    Store  Dept Date       Weekly_Sa…¹ IsHol…² Type    Size Tempe…³ Fuel_…⁴
##    <fct> <dbl> <dbl> <date>           <dbl> <lgl>   <chr>  <dbl>   <dbl>   <dbl>
##  1 1_1       1     1 2010-02-05      24924. FALSE   A     151315    42.3    2.57
##  2 1_1       1     1 2010-02-12      46039. TRUE    A     151315    38.5    2.55
##  3 1_1       1     1 2010-02-19      41596. FALSE   A     151315    39.9    2.51
##  4 1_1       1     1 2010-02-26      19404. FALSE   A     151315    46.6    2.56
##  5 1_1       1     1 2010-03-05      21828. FALSE   A     151315    46.5    2.62
##  6 1_1       1     1 2010-03-12      21043. FALSE   A     151315    57.8    2.67
##  7 1_1       1     1 2010-03-19      22137. FALSE   A     151315    54.6    2.72
##  8 1_1       1     1 2010-03-26      26229. FALSE   A     151315    51.4    2.73
##  9 1_1       1     1 2010-04-02      57258. FALSE   A     151315    62.3    2.72
## 10 1_1       1     1 2010-04-09      42961. FALSE   A     151315    65.9    2.77
## # … with 991 more rows, 7 more variables: MarkDown1 <dbl>, MarkDown2 <dbl>,
## #   MarkDown3 <dbl>, MarkDown4 <dbl>, MarkDown5 <dbl>, CPI <dbl>,
## #   Unemployment <dbl>, and abbreviated variable names ¹​Weekly_Sales,
## #   ²​IsHoliday, ³​Temperature, ⁴​Fuel_Price

TS Features

Using the tk_tsfeatures() function, we can quickly get the “tsfeatures” for each of the time series. A few important points:

  • The features parameter come from the tsfeatures R package. Use one of the function names from tsfeatures R package e.g.(“lumpiness”, “stl_features”).

  • We can supply any function that returns an aggregation (e.g. “mean” will apply the base::mean() function).

  • You can supply custom functions by creating a function and providing it (e.g. my_mean() defined below)

# Custom Function
my_mean <- function(x, na.rm=TRUE) {
  mean(x, na.rm = na.rm)
}

tsfeature_tbl <- walmart_sales_weekly %>%
    group_by(id) %>%
    tk_tsfeatures(
      .date_var = Date,
      .value    = Weekly_Sales,
      .period   = 52,
      .features = c("frequency", "stl_features", "entropy", "acf_features", "my_mean"),
      .scale    = TRUE,
      .prefix   = "ts_"
    ) %>%
    ungroup()

tsfeature_tbl
## # A tibble: 7 × 22
##   id    ts_fre…¹ ts_np…² ts_se…³ ts_tr…⁴ ts_sp…⁵ ts_li…⁶ ts_cu…⁷ ts_e_…⁸ ts_e_…⁹
##   <fct>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1_1         52       1      52 6.70e-4 2.80e-5 -0.0581   0.112  0.349   0.334 
## 2 1_3         52       1      52 6.14e-2 9.87e-6  0.511    0.496  0.0581  0.0660
## 3 1_8         52       1      52 7.56e-1 1.95e-6  6.41     3.67   0.330   0.358 
## 4 1_13        52       1      52 3.54e-1 4.75e-6  2.74     2.25   0.192   0.321 
## 5 1_38        52       1      52 4.25e-1 1.79e-5 -4.07     2.82   0.0459  0.152 
## 6 1_93        52       1      52 7.91e-1 7.54e-7  6.22    -0.684 -0.0248  0.363 
## 7 1_95        52       1      52 6.39e-1 5.67e-7  3.94    -0.377  0.0247  0.161 
## # … with 12 more variables: ts_seasonal_strength <dbl>, ts_peak <dbl>,
## #   ts_trough <dbl>, ts_entropy <dbl>, ts_x_acf1 <dbl>, ts_x_acf10 <dbl>,
## #   ts_diff1_acf1 <dbl>, ts_diff1_acf10 <dbl>, ts_diff2_acf1 <dbl>,
## #   ts_diff2_acf10 <dbl>, ts_seas_acf1 <dbl>, ts_my_mean <dbl>, and abbreviated
## #   variable names ¹​ts_frequency, ²​ts_nperiods, ³​ts_seasonal_period, ⁴​ts_trend,
## #   ⁵​ts_spike, ⁶​ts_linearity, ⁷​ts_curvature, ⁸​ts_e_acf1, ⁹​ts_e_acf10

Clustering with K-Means

We can quickly add cluster assignments with the kmeans() function and some tidyverse data wrangling.

set.seed(123)


cluster_tbl <- tibble(
    cluster = tsfeature_tbl %>% 
        select(-id) %>%
        as.matrix() %>%
        kmeans(centers = 3, nstart = 100) %>%
        pluck("cluster")
) %>%
    bind_cols(
        tsfeature_tbl
    )

cluster_tbl
## # A tibble: 7 × 23
##   cluster id    ts_fre…¹ ts_np…² ts_se…³ ts_tr…⁴ ts_sp…⁵ ts_li…⁶ ts_cu…⁷ ts_e_…⁸
##     <int> <fct>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1       2 1_1         52       1      52 6.70e-4 2.80e-5 -0.0581   0.112  0.349 
## 2       2 1_3         52       1      52 6.14e-2 9.87e-6  0.511    0.496  0.0581
## 3       2 1_8         52       1      52 7.56e-1 1.95e-6  6.41     3.67   0.330 
## 4       1 1_13        52       1      52 3.54e-1 4.75e-6  2.74     2.25   0.192 
## 5       3 1_38        52       1      52 4.25e-1 1.79e-5 -4.07     2.82   0.0459
## 6       3 1_93        52       1      52 7.91e-1 7.54e-7  6.22    -0.684 -0.0248
## 7       1 1_95        52       1      52 6.39e-1 5.67e-7  3.94    -0.377  0.0247
## # … with 13 more variables: ts_e_acf10 <dbl>, ts_seasonal_strength <dbl>,
## #   ts_peak <dbl>, ts_trough <dbl>, ts_entropy <dbl>, ts_x_acf1 <dbl>,
## #   ts_x_acf10 <dbl>, ts_diff1_acf1 <dbl>, ts_diff1_acf10 <dbl>,
## #   ts_diff2_acf1 <dbl>, ts_diff2_acf10 <dbl>, ts_seas_acf1 <dbl>,
## #   ts_my_mean <dbl>, and abbreviated variable names ¹​ts_frequency,
## #   ²​ts_nperiods, ³​ts_seasonal_period, ⁴​ts_trend, ⁵​ts_spike, ⁶​ts_linearity,
## #   ⁷​ts_curvature, ⁸​ts_e_acf1

Visualize the Cluster Assignments

Finally, we can visualize the cluster assignments by joining the cluster_tbl with the original walmart_sales_weekly and then plotting with plot_time_series().

cluster_tbl %>%
    select(cluster, id) %>%
    right_join(walmart_sales_weekly, by = "id") %>%
    group_by(id) %>%
    plot_time_series(
      Date, Weekly_Sales, 
      .color_var   = cluster, 
      .facet_ncol  = 2, 
      .interactive = FALSE
    )

Learning More

My Talk on High-Performance Time Series Forecasting

Time series is changing. Businesses now need 10,000+ time series forecasts every day. This is what I call a High-Performance Time Series Forecasting System (HPTSF) - Accurate, Robust, and Scalable Forecasting.

High-Performance Forecasting Systems will save companies MILLIONS of dollars. Imagine what will happen to your career if you can provide your organization a “High-Performance Time Series Forecasting System” (HPTSF System).

I teach how to build a HPTFS System in my High-Performance Time Series Forecasting Course. If interested in learning Scalable High-Performance Forecasting Strategies then take my course. You will learn:

  • Time Series Machine Learning (cutting-edge) with Modeltime - 30+ Models (Prophet, ARIMA, XGBoost, Random Forest, & many more)
  • NEW - Deep Learning with GluonTS (Competition Winners)
  • Time Series Preprocessing, Noise Reduction, & Anomaly Detection
  • Feature engineering using lagged variables & external regressors
  • Hyperparameter Tuning
  • Time series cross-validation
  • Ensembling Multiple Machine Learning & Univariate Modeling Techniques (Competition Winner)
  • Scalable Forecasting - Forecast 1000+ time series in parallel
  • and more.

Unlock the High-Performance Time Series Forecasting Course