Feature Store & Caching

Persist expensive feature engineering once, reuse it everywhere. The feature store bundled with pytimetk ≥ 2.0 lets you register reusable transforms, materialise results to disk (or any pyarrow-compatible object store), and reload them in downstream notebooks, jobs, or ML pipelines.

1 Why Use the Feature Store?

Teams building time-series models (forecasting, anomaly detection, policy simulation) often compute the same feature sets—calendar signatures, lag stacks, rolling stats—across notebooks, pipelines, and model retrains.

The feature store lets them register those transforms once, materialize them to disk or a shared URI, and reload them instantly later. That avoids re-running expensive calculations, keeps metadata/versioning consistent, and makes it easy to assemble feature matrices across multiple transforms for downstream modeling.

2 Benefits

Avoid repeated work – cache signatures, lag stacks, rolling stats, and any custom transform.
Share across teams – store artifacts on a shared file system or S3/GCS/Azure using pyarrow.fs.
Track metadata automatically – every build records parameters, row counts, hashes, timestamps, and version info.
Coordinate writers – optional file locks prevent conflicting writes when multiple jobs run the same pipeline.

3 Quickstart (Pandas)

Code

import pandas as pd
import pytimetk as tk
from pathlib import Path

sales = tk.load_dataset("bike_sales_sample", parse_dates=["order_date"])

feature_store_root = Path("feature-store-demo").resolve()
store = tk.FeatureStore(root_path=feature_store_root)

store.register(
    "sales_signature",
    lambda df: tk.augment_timeseries_signature(
        df,
        date_column="order_date",
        engine="pandas",
    ),
    default_key_columns=("order_id",),
    description="Calendar features for order history.",
)

signature = store.build("sales_signature", sales, return_engine="pandas")
signature.from_cache, signature.metadata.row_count, signature.metadata.column_count

(True, 2466, 42)

Run the notebook a second time and the store will detect the same data + parameters and serve a cached artifact:

Code

signature_cached = store.build("sales_signature", sales, return_engine="pandas")
signature_cached.from_cache

True

4 Reusing Features Later

Copied metadata and artifacts can be loaded in another process by pointing to the same catalog folder:

Code

fresh_store = tk.FeatureStore(root_path=store.root_path)
reloaded = fresh_store.load("sales_signature", return_engine="pandas")
reloaded.metadata.version, reloaded.metadata.storage_path

('1b8b4a5044cb',
 '/Users/mdancho/Desktop/software/pytimetk/docs/guides/feature-store-demo/sales_signature/1b8b4a5044cb/features.parquet')

5 Inspecting the Catalog

Code

store.list_feature_sets()

	name	version	cache_key	storage_path	storage_format	storage_backend	artifact_uri	created_at	data_fingerprint	transform_fingerprint	...	transform_name	transform_kwargs	pytimetk_version	package_versions	tags	description	key_columns	row_count	column_count	extra_metadata
1	daily_signature	b5737af06b34	b5737af06b34b585fd87b7ce19621ad77269a1a112d3c5...	/Users/mdancho/Desktop/software/pytimetk/docs/...	parquet	local	/Users/mdancho/Desktop/software/pytimetk/docs/...	2025-10-13T00:50:11.564352+00:00	58f3182c6e1e814ebb65cdddeed4f694e087fcb429fb86...	232245564d3e5f4368f45e100ad3cf910c2c5f2435bd3f...	...	<lambda>	{}	1.2.5.9000	{'pandas': '2.2.3', 'polars': '1.32.3'}	()	Calendar signatures on daily revenue.	(order_date,)	177	31	{}
0	sales_signature	1b8b4a5044cb	1b8b4a5044cb0c642be6f3711fecff93dad660c789a338...	/Users/mdancho/Desktop/software/pytimetk/docs/...	parquet	local	/Users/mdancho/Desktop/software/pytimetk/docs/...	2025-10-13T00:50:11.497343+00:00	58f3182c6e1e814ebb65cdddeed4f694e087fcb429fb86...	072694e4aac0e5fb6f354672b1a1125da9bf127a623db0...	...	<lambda>	{}	1.2.5.9000	{'pandas': '2.2.3', 'polars': '1.32.3'}	()	Calendar features for order history.	(order_id,)	2466	42	{}
2	seasonal_fourier	09c5058b609f	09c5058b609f7e3ba7ba85c1e70d4247041e732f7256e6...	/Users/mdancho/Desktop/software/pytimetk/docs/...	parquet	local	/Users/mdancho/Desktop/software/pytimetk/docs/...	2025-10-13T00:50:11.571455+00:00	58f3182c6e1e814ebb65cdddeed4f694e087fcb429fb86...	40d3ecfa320a4c7e379ea7adafd989c91533bdaafa6a98...	...	<lambda>	{}	1.2.5.9000	{'pandas': '2.2.3', 'polars': '1.32.3'}	()	None	(order_date,)	177	8	{}

3 rows × 21 columns

Code

store.describe("sales_signature")

FeatureSetMetadata(name='sales_signature', version='1b8b4a5044cb', cache_key='1b8b4a5044cb0c642be6f3711fecff93dad660c789a3382cbf74c089e5e3be49', storage_path='/Users/mdancho/Desktop/software/pytimetk/docs/guides/feature-store-demo/sales_signature/1b8b4a5044cb/features.parquet', storage_format='parquet', storage_backend='local', artifact_uri='/Users/mdancho/Desktop/software/pytimetk/docs/guides/feature-store-demo', created_at='2025-10-13T00:50:11.497343+00:00', data_fingerprint='58f3182c6e1e814ebb65cdddeed4f694e087fcb429fb86c8ad3cc634d3355f23', transform_fingerprint='072694e4aac0e5fb6f354672b1a1125da9bf127a623db084a6a8c9c7f39cb202', transform_module='__main__', transform_name='<lambda>', transform_kwargs={}, pytimetk_version='1.2.5.9000', package_versions={'pandas': '2.2.3', 'polars': '1.32.3'}, tags=(), description='Calendar features for order history.', key_columns=('order_id',), row_count=2466, column_count=42, extra_metadata={})

6 Using the `.tk` Accessor (Polars)

Code

import polars as pl

pl_sales = pl.from_pandas(sales)

accessor = pl_sales.tk.feature_store(root_path=feature_store_root)

def centered(data: pl.DataFrame) -> pl.DataFrame:
    return data.with_columns(
        (pl.col("total_price") - pl.col("total_price").mean()).alias("total_price_centered"),
    )

accessor.register("sales_centered", centered, default_key_columns=("order_id",))

centered_features = accessor.build("sales_centered")
centered_features.from_cache, centered_features.metadata.storage_backend

(False, 'local')

7 Remote Artifact Storage

Artifacts can live in any pyarrow filesystem. Point artifact_uri at your bucket, folder, or lakehouse and the catalog will continue to live locally.

Code

artifact_uri = Path("remote-artifacts").resolve().as_uri()

s3_store = tk.FeatureStore(
    root_path=feature_store_root / "remote-catalog",  # metadata & locks
    artifact_uri=artifact_uri,            # where feature files live
    lock_timeout=60.0,                    # optional: adjust lock timing
)

s3_store.register(
    "sales_rolling",
    lambda df: tk.augment_rolling(
        df,
        date_column="order_date",
        value_column="total_price",
        window=7,
        window_func=["mean", "max"],
        engine="pandas",
    ),
    default_key_columns=("order_id",),
)

rolling = s3_store.build("sales_rolling", sales, return_engine="pandas")
rolling.metadata.storage_backend, rolling.metadata.storage_path

('pyarrow',
 'file:///Users/mdancho/Desktop/software/pytimetk/docs/guides/remote-artifacts/sales_rolling/644b3b483234/features.parquet')

If the URI uses the file:// scheme the same mechanism works for network file shares or mounted blob storage.

8 Coordinated Writes with Locks

By default, FeatureStore writes lock files under <root_path>/.locks so parallel jobs do not trample each other. Locks are scoped per feature name:

Code

store.enable_locking, (store.root_path / ".locks").iterdir()

(True, <generator object Path.iterdir at 0x106482dc0>)

Disable locking if your workflow guarantees exclusive writers:

Code

no_lock_store = tk.FeatureStore(root_path="tmp-store", enable_locking=False)

9 Assemble Multiple Feature Sets

Code

# Aggregate to a daily view so Fourier terms see a regular cadence.
store.register(
    "daily_signature",
    lambda df: (
        df.groupby("order_date", as_index=False)
        .agg({"total_price": "sum"})
        .augment_timeseries_signature(
            date_column="order_date",
            engine="pandas",
        )
    ),
    default_key_columns=("order_date",),
    description="Calendar signatures on daily revenue.",
)

store.register(
    "seasonal_fourier",
    lambda df: (
        df.groupby("order_date", as_index=False)
        .agg({"total_price": "sum"})
        .augment_fourier(
            date_column="order_date",
            periods=7,
            max_order=3,
            engine="pandas",
        )
    ),
    default_key_columns=("order_date",),
)

store.build("daily_signature", sales, return_engine="pandas")
store.build("seasonal_fourier", sales, return_engine="pandas")

assembled = store.assemble(
    ["daily_signature", "seasonal_fourier"],
    join_keys=("order_date",),
    return_engine="pandas",
)
assembled.metadata.row_count, assembled.metadata.column_count

(177, 38)

The assembled result is in-memory only (no artifact written), but includes metadata so you can track how it was produced.

10 MLflow Integration

Record feature versions alongside your models with MLflow so every run knows exactly which cached artifact it depended on.

Code

try:
    import mlflow

    from pytimetk.feature_store import (
        build_features_with_mlflow,
        load_features_from_mlflow,
    )
except Exception as exc:
    print("Skipping MLflow example:", exc)
else:
    mlflow_tracking = Path("mlruns").resolve().as_uri()
    mlflow.set_tracking_uri(mlflow_tracking)

    with mlflow.start_run() as train_run:
        signature_mlflow = build_features_with_mlflow(
            store,
            "sales_signature",
            sales,
            params_prefix="train",
            return_engine="pandas",
        )
        signature_mlflow.metadata.version

    reloaded = load_features_from_mlflow(
        store,
        "sales_signature",
        run_id=train_run.info.run_id,
        params_prefix="train",
        return_engine="pandas",
    )
    reloaded.metadata.version

11 Cleaning Up

Remove a feature set (optionally keeping the cached artifact):

Code

# Remove the Polars feature set via its accessor store.
accessor.store.drop("sales_centered")        # deletes metadata and artifact
# Drop a pandas feature set but keep the cached artifact for reuse.
store.drop("sales_signature", delete_artifact=False)

For remote backends the catalog metadata remains local, while the artifacts are removed from the remote filesystem via pyarrow.fs.

1 Why Use the Feature Store?

2 Benefits

3 Quickstart (Pandas)

4 Reusing Features Later

5 Inspecting the Catalog

6 Using the .tk Accessor (Polars)

7 Remote Artifact Storage

8 Coordinated Writes with Locks

9 Assemble Multiple Feature Sets

10 MLflow Integration

11 Cleaning Up

6 Using the `.tk` Accessor (Polars)