Clustering Stocks

Segment a portfolio of time series with a feature-based clustering workflow.

1 This applied tutorial covers the use of:

  • tk.ts_summary() to validate that each series is ready for modeling.
  • tk.ts_features() to engineer clustering-ready signatures.
  • tk.plot_timeseries() to review the resulting groups.
  • sklearn.cluster.KMeans to segment similar time series.
How to navigate this guide
  1. Get to a working clustering pipeline in five minutes.
  2. Engineer richer features and run K-Means.
  3. Inspect and communicate the clusters.

2 Five Minutes to Cluster

2.1 Load packages

Code
import pytimetk as tk
import pandas as pd
import numpy as np

from tsfeatures import acf_features, lumpiness, entropy

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

import plotly.express as px

2.2 Load and glimpse the stock history

We’ll use the built-in stocks_daily dataset (six large-cap tech tickers) so the entire workflow runs locally without extra data wrangling.

Code
stocks = tk.load_dataset("stocks_daily", parse_dates=["date"])
stocks.glimpse()
<class 'pandas.core.frame.DataFrame'>: 16194 rows of 8 columns
symbol:    object            ['META', 'META', 'META', 'META', 'META', 'M ...
date:      datetime64[ns]    [Timestamp('2013-01-02 00:00:00'), Timestam ...
open:      float64           [27.440000534057617, 27.8799991607666, 28.0 ...
high:      float64           [28.18000030517578, 28.469999313354492, 28. ...
low:       float64           [27.420000076293945, 27.59000015258789, 27. ...
close:     float64           [28.0, 27.770000457763672, 28.7600002288818 ...
volume:    int64             [69846400, 63140600, 72715400, 83781800, 45 ...
adjusted:  float64           [28.0, 27.770000457763672, 28.7600002288818 ...
3 core properties of time series data
  • Time index: date
  • Value column: adjusted
  • Group column: symbol

2.3 Understand the time structure with ts_summary

Before engineering features, confirm that every series shares a common frequency and coverage. ts_summary() handles this in one line.

Code
ts_profile = (
    stocks
        .groupby("symbol", sort=False)
        .ts_summary(date_column="date")
        .loc[:, ["symbol", "date_start", "date_end", "date_n", "diff_median_seconds", "diff_max_seconds"]]
        .assign(
            trading_years=lambda df_: (pd.to_datetime(df_["date_end"]) - pd.to_datetime(df_["date_start"])).dt.days / 365.25
        )
)

ts_profile
symbol date_start date_end date_n diff_median_seconds diff_max_seconds trading_years
0 META 2013-01-02 2023-09-21 2699 86400.0 345600.0 10.715948
0 AMZN 2013-01-02 2023-09-21 2699 86400.0 345600.0 10.715948
0 AAPL 2013-01-02 2023-09-21 2699 86400.0 345600.0 10.715948
0 NFLX 2013-01-02 2023-09-21 2699 86400.0 345600.0 10.715948
0 NVDA 2013-01-02 2023-09-21 2699 86400.0 345600.0 10.715948
0 GOOG 2013-01-02 2023-09-21 2699 86400.0 345600.0 10.715948

The stocks_daily dataset gives us over a decade of trading history per ticker with a consistent daily cadence—perfect for clustering.

3 Engineer clustering features

We’ll create two complementary feature sets: performance metrics derived from returns and higher-order time-series signatures from ts_features(). Combining both gives K-Means enough separation power.

3.1 Create performance signatures with pandas

Code
stocks = (
    stocks
        .sort_values(["symbol", "date"])
        .assign(
            daily_return=lambda df_: df_.groupby("symbol")["adjusted"].pct_change()
        )
)

performance = (
    stocks
        .groupby("symbol", sort=False)
        .agg(
            avg_daily_return=("daily_return", "mean"),
            volatility=("daily_return", "std"),
            avg_volume=("volume", "mean"),
            total_return=("adjusted", lambda s: s.iloc[-1] / s.iloc[0] - 1),
        )
        .reset_index()
)

performance["volatility"] = performance["volatility"] * np.sqrt(252)

performance
symbol avg_daily_return volatility avg_volume total_return
0 AAPL 0.001030 0.286314 1.639852e+08 9.358414
1 AMZN 0.001068 0.327350 7.901033e+07 9.052466
2 GOOG 0.000885 0.274113 3.875715e+07 6.292216
3 META 0.001170 0.385613 2.951687e+07 9.561786
4 NFLX 0.001689 0.471206 1.221723e+07 28.225626
5 NVDA 0.002229 0.449569 4.494016e+07 138.692436

volatility is annualized, total_return measures the entire sample’s growth, and avg_volume helps distinguish high-activity tickers (e.g., NVDA) from lower-volume names.

3.2 Add automated time-series features with ts_features

Code
ts_feature_matrix = (
    stocks
        .groupby("symbol", sort=False)
        .ts_features(
            date_column="date",
            value_column="adjusted",
            features=[acf_features, lumpiness, entropy],
            freq=5,                 # one trading week seasonality hint
            threads=1,
            show_progress=False,
        )
)

ts_feature_matrix
symbol entropy lumpiness x_acf1 x_acf10 diff1_acf1 diff1_acf10 diff2_acf1 diff2_acf10 seas_acf1
0 AAPL 0.207972 0.000005 0.998696 9.858531 -0.037978 0.013670 -0.505131 0.292853 0.993551
1 AMZN 0.145684 0.000011 0.998821 9.868549 -0.023167 0.009015 -0.518806 0.293242 0.994014
2 GOOG 0.210518 0.000009 0.998404 9.826556 -0.035739 0.011965 -0.513201 0.279758 0.991927
3 META 0.255702 0.000082 0.997744 9.770694 -0.037784 0.012685 -0.526608 0.299610 0.989219
4 NFLX 0.182489 0.000037 0.998426 9.829973 -0.026235 0.007010 -0.521142 0.281312 0.992120
5 NVDA 0.317476 0.000103 0.997048 9.650296 -0.016769 0.016031 -0.498054 0.280397 0.984443

The selected feature trio captures short-term autocorrelation (acf_features), regime shifts (lumpiness), and complexity (entropy) without overwhelming our small sample.

3.3 Combine the feature blocks

Code
feature_table = (
    performance
        .merge(ts_feature_matrix, on="symbol")
        .merge(
            ts_profile.loc[:, ["symbol", "trading_years"]],
            on="symbol"
        )
)

feature_table
symbol avg_daily_return volatility avg_volume total_return entropy lumpiness x_acf1 x_acf10 diff1_acf1 diff1_acf10 diff2_acf1 diff2_acf10 seas_acf1 trading_years
0 AAPL 0.001030 0.286314 1.639852e+08 9.358414 0.207972 0.000005 0.998696 9.858531 -0.037978 0.013670 -0.505131 0.292853 0.993551 10.715948
1 AMZN 0.001068 0.327350 7.901033e+07 9.052466 0.145684 0.000011 0.998821 9.868549 -0.023167 0.009015 -0.518806 0.293242 0.994014 10.715948
2 GOOG 0.000885 0.274113 3.875715e+07 6.292216 0.210518 0.000009 0.998404 9.826556 -0.035739 0.011965 -0.513201 0.279758 0.991927 10.715948
3 META 0.001170 0.385613 2.951687e+07 9.561786 0.255702 0.000082 0.997744 9.770694 -0.037784 0.012685 -0.526608 0.299610 0.989219 10.715948
4 NFLX 0.001689 0.471206 1.221723e+07 28.225626 0.182489 0.000037 0.998426 9.829973 -0.026235 0.007010 -0.521142 0.281312 0.992120 10.715948
5 NVDA 0.002229 0.449569 4.494016e+07 138.692436 0.317476 0.000103 0.997048 9.650296 -0.016769 0.016031 -0.498054 0.280397 0.984443 10.715948

All features are numeric except for symbol, so they’re ready for scaling and clustering.

4 Fit K-Means and visualize clusters

4.1 Scale, cluster, and project

Code
numeric_cols = feature_table.select_dtypes(include="number").columns

scaler = StandardScaler()
X_scaled = scaler.fit_transform(feature_table[numeric_cols])

kmeans = KMeans(n_clusters=3, n_init="auto", random_state=123)
feature_table["cluster"] = kmeans.fit_predict(X_scaled)

pca = PCA(n_components=2, random_state=123)
embedding = pca.fit_transform(X_scaled)

feature_table[["pc1", "pc2"]] = embedding

feature_table
symbol avg_daily_return volatility avg_volume total_return entropy lumpiness x_acf1 x_acf10 diff1_acf1 diff1_acf10 diff2_acf1 diff2_acf10 seas_acf1 trading_years cluster pc1 pc2
0 AAPL 0.001030 0.286314 1.639852e+08 9.358414 0.207972 0.000005 0.998696 9.858531 -0.037978 0.013670 -0.505131 0.292853 0.993551 10.715948 1 2.160118 2.444434
1 AMZN 0.001068 0.327350 7.901033e+07 9.052466 0.145684 0.000011 0.998821 9.868549 -0.023167 0.009015 -0.518806 0.293242 0.994014 10.715948 1 2.426564 -0.709014
2 GOOG 0.000885 0.274113 3.875715e+07 6.292216 0.210518 0.000009 0.998404 9.826556 -0.035739 0.011965 -0.513201 0.279758 0.991927 10.715948 1 1.408552 0.472603
3 META 0.001170 0.385613 2.951687e+07 9.561786 0.255702 0.000082 0.997744 9.770694 -0.037784 0.012685 -0.526608 0.299610 0.989219 10.715948 0 -0.377876 -0.104335
4 NFLX 0.001689 0.471206 1.221723e+07 28.225626 0.182489 0.000037 0.998426 9.829973 -0.026235 0.007010 -0.521142 0.281312 0.992120 10.715948 0 0.233911 -2.723692
5 NVDA 0.002229 0.449569 4.494016e+07 138.692436 0.317476 0.000103 0.997048 9.650296 -0.016769 0.016031 -0.498054 0.280397 0.984443 10.715948 2 -5.851268 0.620004

This small universe ends up with three distinct segments:

  • Cluster 0: Momentum-heavy names with higher entropy (META, NFLX).
  • Cluster 1: Steady stalwarts (AAPL, AMZN, GOOG).
  • Cluster 2: High-volatility outlier (NVDA).

4.2 Explore the embedding with Plotly

Code
scatter_fig = px.scatter(
    feature_table,
    x="pc1",
    y="pc2",
    color=feature_table["cluster"].astype(str),
    text="symbol",
    size="volatility",
    hover_data={
        "avg_daily_return":":.4f",
        "volatility":":.2f",
        "total_return":":.1%",
        "cluster": False,
        "pc1": False,
        "pc2": False,
    },
    title="Stock clusters powered by pytimetk features",
    width=800,
    height=500,
)

scatter_fig.update_traces(textposition="top center")
scatter_fig

The hover tooltips make it easy to validate how volatility and long-run growth drive the separation.

4.3 Inspect the original series by cluster

Bring the cluster assignments back to the raw data and lean on plot_timeseries() for an interactive review.

Code
clustered_history = stocks.merge(feature_table[["symbol", "cluster"]], on="symbol")

cluster_fig = (
    clustered_history
        .groupby("cluster")
        .plot_timeseries(
            date_column="date",
            value_column="adjusted",
            color_column="symbol",
            facet_ncol=1,
            facet_scales="free_y",
            smooth=False,
            plotly_dropdown=True,
            width=900,
            height=600,
            engine="plotly",
            title="Clustered price history (toggle clusters with the dropdown)",
        )
)

cluster_fig

Use the Plotly dropdown to focus on a single segment and drill down by ticker.

5 Where to go next

  • Swap in additional ts_features (e.g., hurst, stability) or engineered ratios to refine the segmentation.
  • Run KMeans across a range of k values and compare inertia or silhouette scores for a data-driven cluster count.
  • Schedule the workflow: persist feature_table, then monitor how new data shifts cluster memberships over time.