Segment a portfolio of time series with a feature-based clustering workflow.
This applied tutorial covers the use of:
tk.ts_summary()
to validate that each series is ready for modeling.
tk.ts_features()
to engineer clustering-ready signatures.
tk.plot_timeseries()
to review the resulting groups.
sklearn.cluster.KMeans
to segment similar time series.
Get to a working clustering pipeline in five minutes.
Engineer richer features and run K-Means.
Inspect and communicate the clusters.
Five Minutes to Cluster
Load packages
Code
import pytimetk as tk
import pandas as pd
import numpy as np
from tsfeatures import acf_features, lumpiness, entropy
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import plotly.express as px
Load and glimpse the stock history
We’ll use the built-in stocks_daily
dataset (six large-cap tech tickers) so the entire workflow runs locally without extra data wrangling.
Code
stocks = tk.load_dataset("stocks_daily" , parse_dates= ["date" ])
stocks.glimpse()
<class 'pandas.core.frame.DataFrame'>: 16194 rows of 8 columns
symbol: object ['META', 'META', 'META', 'META', 'META', 'M ...
date: datetime64[ns] [Timestamp('2013-01-02 00:00:00'), Timestam ...
open: float64 [27.440000534057617, 27.8799991607666, 28.0 ...
high: float64 [28.18000030517578, 28.469999313354492, 28. ...
low: float64 [27.420000076293945, 27.59000015258789, 27. ...
close: float64 [28.0, 27.770000457763672, 28.7600002288818 ...
volume: int64 [69846400, 63140600, 72715400, 83781800, 45 ...
adjusted: float64 [28.0, 27.770000457763672, 28.7600002288818 ...
Time index: date
Value column: adjusted
Group column: symbol
Understand the time structure with ts_summary
Before engineering features, confirm that every series shares a common frequency and coverage. ts_summary()
handles this in one line.
Code
ts_profile = (
stocks
.groupby("symbol" , sort= False )
.ts_summary(date_column= "date" )
.loc[:, ["symbol" , "date_start" , "date_end" , "date_n" , "diff_median_seconds" , "diff_max_seconds" ]]
.assign(
trading_years= lambda df_: (pd.to_datetime(df_["date_end" ]) - pd.to_datetime(df_["date_start" ])).dt.days / 365.25
)
)
ts_profile
0
META
2013-01-02
2023-09-21
2699
86400.0
345600.0
10.715948
0
AMZN
2013-01-02
2023-09-21
2699
86400.0
345600.0
10.715948
0
AAPL
2013-01-02
2023-09-21
2699
86400.0
345600.0
10.715948
0
NFLX
2013-01-02
2023-09-21
2699
86400.0
345600.0
10.715948
0
NVDA
2013-01-02
2023-09-21
2699
86400.0
345600.0
10.715948
0
GOOG
2013-01-02
2023-09-21
2699
86400.0
345600.0
10.715948
The stocks_daily
dataset gives us over a decade of trading history per ticker with a consistent daily cadence—perfect for clustering.
Engineer clustering features
We’ll create two complementary feature sets: performance metrics derived from returns and higher-order time-series signatures from ts_features()
. Combining both gives K-Means enough separation power.
Add automated time-series features with ts_features
Code
ts_feature_matrix = (
stocks
.groupby("symbol" , sort= False )
.ts_features(
date_column= "date" ,
value_column= "adjusted" ,
features= [acf_features, lumpiness, entropy],
freq= 5 , # one trading week seasonality hint
threads= 1 ,
show_progress= False ,
)
)
ts_feature_matrix
0
AAPL
0.207972
0.000005
0.998696
9.858531
-0.037978
0.013670
-0.505131
0.292853
0.993551
1
AMZN
0.145684
0.000011
0.998821
9.868549
-0.023167
0.009015
-0.518806
0.293242
0.994014
2
GOOG
0.210518
0.000009
0.998404
9.826556
-0.035739
0.011965
-0.513201
0.279758
0.991927
3
META
0.255702
0.000082
0.997744
9.770694
-0.037784
0.012685
-0.526608
0.299610
0.989219
4
NFLX
0.182489
0.000037
0.998426
9.829973
-0.026235
0.007010
-0.521142
0.281312
0.992120
5
NVDA
0.317476
0.000103
0.997048
9.650296
-0.016769
0.016031
-0.498054
0.280397
0.984443
The selected feature trio captures short-term autocorrelation (acf_features
), regime shifts (lumpiness
), and complexity (entropy
) without overwhelming our small sample.
Combine the feature blocks
Code
feature_table = (
performance
.merge(ts_feature_matrix, on= "symbol" )
.merge(
ts_profile.loc[:, ["symbol" , "trading_years" ]],
on= "symbol"
)
)
feature_table
0
AAPL
0.001030
0.286314
1.639852e+08
9.358414
0.207972
0.000005
0.998696
9.858531
-0.037978
0.013670
-0.505131
0.292853
0.993551
10.715948
1
AMZN
0.001068
0.327350
7.901033e+07
9.052466
0.145684
0.000011
0.998821
9.868549
-0.023167
0.009015
-0.518806
0.293242
0.994014
10.715948
2
GOOG
0.000885
0.274113
3.875715e+07
6.292216
0.210518
0.000009
0.998404
9.826556
-0.035739
0.011965
-0.513201
0.279758
0.991927
10.715948
3
META
0.001170
0.385613
2.951687e+07
9.561786
0.255702
0.000082
0.997744
9.770694
-0.037784
0.012685
-0.526608
0.299610
0.989219
10.715948
4
NFLX
0.001689
0.471206
1.221723e+07
28.225626
0.182489
0.000037
0.998426
9.829973
-0.026235
0.007010
-0.521142
0.281312
0.992120
10.715948
5
NVDA
0.002229
0.449569
4.494016e+07
138.692436
0.317476
0.000103
0.997048
9.650296
-0.016769
0.016031
-0.498054
0.280397
0.984443
10.715948
All features are numeric except for symbol
, so they’re ready for scaling and clustering.
Fit K-Means and visualize clusters
Scale, cluster, and project
Code
numeric_cols = feature_table.select_dtypes(include= "number" ).columns
scaler = StandardScaler()
X_scaled = scaler.fit_transform(feature_table[numeric_cols])
kmeans = KMeans(n_clusters= 3 , n_init= "auto" , random_state= 123 )
feature_table["cluster" ] = kmeans.fit_predict(X_scaled)
pca = PCA(n_components= 2 , random_state= 123 )
embedding = pca.fit_transform(X_scaled)
feature_table[["pc1" , "pc2" ]] = embedding
feature_table
0
AAPL
0.001030
0.286314
1.639852e+08
9.358414
0.207972
0.000005
0.998696
9.858531
-0.037978
0.013670
-0.505131
0.292853
0.993551
10.715948
1
2.160118
2.444434
1
AMZN
0.001068
0.327350
7.901033e+07
9.052466
0.145684
0.000011
0.998821
9.868549
-0.023167
0.009015
-0.518806
0.293242
0.994014
10.715948
1
2.426564
-0.709014
2
GOOG
0.000885
0.274113
3.875715e+07
6.292216
0.210518
0.000009
0.998404
9.826556
-0.035739
0.011965
-0.513201
0.279758
0.991927
10.715948
1
1.408552
0.472603
3
META
0.001170
0.385613
2.951687e+07
9.561786
0.255702
0.000082
0.997744
9.770694
-0.037784
0.012685
-0.526608
0.299610
0.989219
10.715948
0
-0.377876
-0.104335
4
NFLX
0.001689
0.471206
1.221723e+07
28.225626
0.182489
0.000037
0.998426
9.829973
-0.026235
0.007010
-0.521142
0.281312
0.992120
10.715948
0
0.233911
-2.723692
5
NVDA
0.002229
0.449569
4.494016e+07
138.692436
0.317476
0.000103
0.997048
9.650296
-0.016769
0.016031
-0.498054
0.280397
0.984443
10.715948
2
-5.851268
0.620004
This small universe ends up with three distinct segments:
Cluster 0: Momentum-heavy names with higher entropy (META, NFLX).
Cluster 1: Steady stalwarts (AAPL, AMZN, GOOG).
Cluster 2: High-volatility outlier (NVDA).
Explore the embedding with Plotly
Code
scatter_fig = px.scatter(
feature_table,
x= "pc1" ,
y= "pc2" ,
color= feature_table["cluster" ].astype(str ),
text= "symbol" ,
size= "volatility" ,
hover_data= {
"avg_daily_return" :":.4f" ,
"volatility" :":.2f" ,
"total_return" :":.1%" ,
"cluster" : False ,
"pc1" : False ,
"pc2" : False ,
},
title= "Stock clusters powered by pytimetk features" ,
width= 800 ,
height= 500 ,
)
scatter_fig.update_traces(textposition= "top center" )
scatter_fig
The hover tooltips make it easy to validate how volatility and long-run growth drive the separation.
Inspect the original series by cluster
Bring the cluster assignments back to the raw data and lean on plot_timeseries()
for an interactive review.
Code
clustered_history = stocks.merge(feature_table[["symbol" , "cluster" ]], on= "symbol" )
cluster_fig = (
clustered_history
.groupby("cluster" )
.plot_timeseries(
date_column= "date" ,
value_column= "adjusted" ,
color_column= "symbol" ,
facet_ncol= 1 ,
facet_scales= "free_y" ,
smooth= False ,
plotly_dropdown= True ,
width= 900 ,
height= 600 ,
engine= "plotly" ,
title= "Clustered price history (toggle clusters with the dropdown)" ,
)
)
cluster_fig
Use the Plotly dropdown to focus on a single segment and drill down by ticker.
Where to go next
Swap in additional ts_features
(e.g., hurst
, stability
) or engineered ratios to refine the segmentation.
Run KMeans
across a range of k
values and compare inertia or silhouette scores for a data-driven cluster count.
Schedule the workflow: persist feature_table
, then monitor how new data shifts cluster memberships over time.