Clustering Stocks

Segment a portfolio of time series with a feature-based clustering workflow.

1 This applied tutorial covers the use of:

tk.ts_summary() to validate that each series is ready for modeling.
tk.ts_features() to engineer clustering-ready signatures.
tk.plot_timeseries() to review the resulting groups.
sklearn.cluster.KMeans to segment similar time series.

How to navigate this guide

Get to a working clustering pipeline in five minutes.
Engineer richer features and run K-Means.
Inspect and communicate the clusters.

2 Five Minutes to Cluster

2.1 Load packages

Code

import pytimetk as tk
import pandas as pd
import numpy as np

from tsfeatures import acf_features, lumpiness, entropy

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

import plotly.express as px

2.2 Load and glimpse the stock history

We’ll use the built-in stocks_daily dataset (six large-cap tech tickers) so the entire workflow runs locally without extra data wrangling.

Code

stocks = tk.load_dataset("stocks_daily", parse_dates=["date"])
stocks.glimpse()

<class 'pandas.core.frame.DataFrame'>: 16194 rows of 8 columns
symbol:    object            ['META', 'META', 'META', 'META', 'META', 'M ...
date:      datetime64[ns]    [Timestamp('2013-01-02 00:00:00'), Timestam ...
open:      float64           [27.440000534057617, 27.8799991607666, 28.0 ...
high:      float64           [28.18000030517578, 28.469999313354492, 28. ...
low:       float64           [27.420000076293945, 27.59000015258789, 27. ...
close:     float64           [28.0, 27.770000457763672, 28.7600002288818 ...
volume:    int64             [69846400, 63140600, 72715400, 83781800, 45 ...
adjusted:  float64           [28.0, 27.770000457763672, 28.7600002288818 ...

3 core properties of time series data

Time index: date
Value column: adjusted
Group column: symbol

2.3 Understand the time structure with `ts_summary`

Before engineering features, confirm that every series shares a common frequency and coverage. ts_summary() handles this in one line.

Code

ts_profile = (
    stocks
        .groupby("symbol", sort=False)
        .ts_summary(date_column="date")
        .loc[:, ["symbol", "date_start", "date_end", "date_n", "diff_median_seconds", "diff_max_seconds"]]
        .assign(
            trading_years=lambda df_: (pd.to_datetime(df_["date_end"]) - pd.to_datetime(df_["date_start"])).dt.days / 365.25
        )
)

ts_profile

symbol	date_start	date_end	date_n	diff_median_seconds	diff_max_seconds	trading_years
META	2013-01-02	2023-09-21	2699	86400.0	345600.0	10.715948
AMZN	2013-01-02	2023-09-21	2699	86400.0	345600.0	10.715948
AAPL	2013-01-02	2023-09-21	2699	86400.0	345600.0	10.715948
NFLX	2013-01-02	2023-09-21	2699	86400.0	345600.0	10.715948
NVDA	2013-01-02	2023-09-21	2699	86400.0	345600.0	10.715948
GOOG	2013-01-02	2023-09-21	2699	86400.0	345600.0	10.715948

The stocks_daily dataset gives us over a decade of trading history per ticker with a consistent daily cadence—perfect for clustering.

3 Engineer clustering features

We’ll create two complementary feature sets: performance metrics derived from returns and higher-order time-series signatures from ts_features(). Combining both gives K-Means enough separation power.

3.1 Create performance signatures with pandas

Code

stocks = (
    stocks
        .sort_values(["symbol", "date"])
        .assign(
            daily_return=lambda df_: df_.groupby("symbol")["adjusted"].pct_change()
        )
)

performance = (
    stocks
        .groupby("symbol", sort=False)
        .agg(
            avg_daily_return=("daily_return", "mean"),
            volatility=("daily_return", "std"),
            avg_volume=("volume", "mean"),
            total_return=("adjusted", lambda s: s.iloc[-1] / s.iloc[0] - 1),
        )
        .reset_index()
)

performance["volatility"] = performance["volatility"] * np.sqrt(252)

performance

	symbol	avg_daily_return	volatility	avg_volume	total_return
0	AAPL	0.001030	0.286314	1.639852e+08	9.358414
1	AMZN	0.001068	0.327350	7.901033e+07	9.052466
2	GOOG	0.000885	0.274113	3.875715e+07	6.292216
3	META	0.001170	0.385613	2.951687e+07	9.561786
4	NFLX	0.001689	0.471206	1.221723e+07	28.225626
5	NVDA	0.002229	0.449569	4.494016e+07	138.692436

volatility is annualized, total_return measures the entire sample’s growth, and avg_volume helps distinguish high-activity tickers (e.g., NVDA) from lower-volume names.

3.2 Add automated time-series features with `ts_features`

Code

ts_feature_matrix = (
    stocks
        .groupby("symbol", sort=False)
        .ts_features(
            date_column="date",
            value_column="adjusted",
            features=[acf_features, lumpiness, entropy],
            freq=5,                 # one trading week seasonality hint
            threads=1,
            show_progress=False,
        )
)

ts_feature_matrix

	symbol	entropy	lumpiness	x_acf1	x_acf10	diff1_acf1	diff1_acf10	diff2_acf1	diff2_acf10	seas_acf1
0	AAPL	0.207972	0.000005	0.998696	9.858531	-0.037978	0.013670	-0.505131	0.292853	0.993551
1	AMZN	0.145684	0.000011	0.998821	9.868549	-0.023167	0.009015	-0.518806	0.293242	0.994014
2	GOOG	0.210518	0.000009	0.998404	9.826556	-0.035739	0.011965	-0.513201	0.279758	0.991927
3	META	0.255702	0.000082	0.997744	9.770694	-0.037784	0.012685	-0.526608	0.299610	0.989219
4	NFLX	0.182489	0.000037	0.998426	9.829973	-0.026235	0.007010	-0.521142	0.281312	0.992120
5	NVDA	0.317476	0.000103	0.997048	9.650296	-0.016769	0.016031	-0.498054	0.280397	0.984443

The selected feature trio captures short-term autocorrelation (acf_features), regime shifts (lumpiness), and complexity (entropy) without overwhelming our small sample.

3.3 Combine the feature blocks

Code

feature_table = (
    performance
        .merge(ts_feature_matrix, on="symbol")
        .merge(
            ts_profile.loc[:, ["symbol", "trading_years"]],
            on="symbol"
        )
)

feature_table

	symbol	avg_daily_return	volatility	avg_volume	total_return	entropy	lumpiness	x_acf1	x_acf10	diff1_acf1	diff1_acf10	diff2_acf1	diff2_acf10	seas_acf1	trading_years
0	AAPL	0.001030	0.286314	1.639852e+08	9.358414	0.207972	0.000005	0.998696	9.858531	-0.037978	0.013670	-0.505131	0.292853	0.993551	10.715948
1	AMZN	0.001068	0.327350	7.901033e+07	9.052466	0.145684	0.000011	0.998821	9.868549	-0.023167	0.009015	-0.518806	0.293242	0.994014	10.715948
2	GOOG	0.000885	0.274113	3.875715e+07	6.292216	0.210518	0.000009	0.998404	9.826556	-0.035739	0.011965	-0.513201	0.279758	0.991927	10.715948
3	META	0.001170	0.385613	2.951687e+07	9.561786	0.255702	0.000082	0.997744	9.770694	-0.037784	0.012685	-0.526608	0.299610	0.989219	10.715948
4	NFLX	0.001689	0.471206	1.221723e+07	28.225626	0.182489	0.000037	0.998426	9.829973	-0.026235	0.007010	-0.521142	0.281312	0.992120	10.715948
5	NVDA	0.002229	0.449569	4.494016e+07	138.692436	0.317476	0.000103	0.997048	9.650296	-0.016769	0.016031	-0.498054	0.280397	0.984443	10.715948

All features are numeric except for symbol, so they’re ready for scaling and clustering.

4 Fit K-Means and visualize clusters

4.1 Scale, cluster, and project

Code

numeric_cols = feature_table.select_dtypes(include="number").columns

scaler = StandardScaler()
X_scaled = scaler.fit_transform(feature_table[numeric_cols])

kmeans = KMeans(n_clusters=3, n_init="auto", random_state=123)
feature_table["cluster"] = kmeans.fit_predict(X_scaled)

pca = PCA(n_components=2, random_state=123)
embedding = pca.fit_transform(X_scaled)

feature_table[["pc1", "pc2"]] = embedding

feature_table

	symbol	avg_daily_return	volatility	avg_volume	total_return	entropy	lumpiness	x_acf1	x_acf10	diff1_acf1	diff1_acf10	diff2_acf1	diff2_acf10	seas_acf1	trading_years	cluster	pc1	pc2
0	AAPL	0.001030	0.286314	1.639852e+08	9.358414	0.207972	0.000005	0.998696	9.858531	-0.037978	0.013670	-0.505131	0.292853	0.993551	10.715948	1	2.160118	2.444434
1	AMZN	0.001068	0.327350	7.901033e+07	9.052466	0.145684	0.000011	0.998821	9.868549	-0.023167	0.009015	-0.518806	0.293242	0.994014	10.715948	1	2.426564	-0.709014
2	GOOG	0.000885	0.274113	3.875715e+07	6.292216	0.210518	0.000009	0.998404	9.826556	-0.035739	0.011965	-0.513201	0.279758	0.991927	10.715948	1	1.408552	0.472603
3	META	0.001170	0.385613	2.951687e+07	9.561786	0.255702	0.000082	0.997744	9.770694	-0.037784	0.012685	-0.526608	0.299610	0.989219	10.715948	0	-0.377876	-0.104335
4	NFLX	0.001689	0.471206	1.221723e+07	28.225626	0.182489	0.000037	0.998426	9.829973	-0.026235	0.007010	-0.521142	0.281312	0.992120	10.715948	0	0.233911	-2.723692
5	NVDA	0.002229	0.449569	4.494016e+07	138.692436	0.317476	0.000103	0.997048	9.650296	-0.016769	0.016031	-0.498054	0.280397	0.984443	10.715948	2	-5.851268	0.620004

This small universe ends up with three distinct segments:

Cluster 0: Momentum-heavy names with higher entropy (META, NFLX).
Cluster 1: Steady stalwarts (AAPL, AMZN, GOOG).
Cluster 2: High-volatility outlier (NVDA).

4.2 Explore the embedding with Plotly

Code

scatter_fig = px.scatter(
    feature_table,
    x="pc1",
    y="pc2",
    color=feature_table["cluster"].astype(str),
    text="symbol",
    size="volatility",
    hover_data={
        "avg_daily_return":":.4f",
        "volatility":":.2f",
        "total_return":":.1%",
        "cluster": False,
        "pc1": False,
        "pc2": False,
    },
    title="Stock clusters powered by pytimetk features",
    width=800,
    height=500,
)

scatter_fig.update_traces(textposition="top center")
scatter_fig

The hover tooltips make it easy to validate how volatility and long-run growth drive the separation.

4.3 Inspect the original series by cluster

Bring the cluster assignments back to the raw data and lean on plot_timeseries() for an interactive review.

Code

clustered_history = stocks.merge(feature_table[["symbol", "cluster"]], on="symbol")

cluster_fig = (
    clustered_history
        .groupby("cluster")
        .plot_timeseries(
            date_column="date",
            value_column="adjusted",
            color_column="symbol",
            facet_ncol=1,
            facet_scales="free_y",
            smooth=False,
            plotly_dropdown=True,
            width=900,
            height=600,
            engine="plotly",
            title="Clustered price history (toggle clusters with the dropdown)",
        )
)

cluster_fig

Use the Plotly dropdown to focus on a single segment and drill down by ticker.

5 Where to go next

Swap in additional ts_features (e.g., hurst, stability) or engineered ratios to refine the segmentation.
Run KMeans across a range of k values and compare inertia or silhouette scores for a data-driven cluster count.
Schedule the workflow: persist feature_table, then monitor how new data shifts cluster memberships over time.

1 This applied tutorial covers the use of:

2 Five Minutes to Cluster

2.1 Load packages

2.2 Load and glimpse the stock history

2.3 Understand the time structure with ts_summary

3 Engineer clustering features

3.1 Create performance signatures with pandas

3.2 Add automated time-series features with ts_features

3.3 Combine the feature blocks

4 Fit K-Means and visualize clusters

4.1 Scale, cluster, and project

4.2 Explore the embedding with Plotly

4.3 Inspect the original series by cluster

5 Where to go next

2.3 Understand the time structure with `ts_summary`

3.2 Add automated time-series features with `ts_features`