Feature Store & Caching

Persist expensive feature engineering once, reuse it everywhere. The feature store bundled with pytimetk ≥ 2.0 lets you register reusable transforms, materialise results to disk (or any pyarrow-compatible object store), and reload them in downstream notebooks, jobs, or ML pipelines.

1 Why Use the Feature Store?

Teams building time-series models (forecasting, anomaly detection, policy simulation) often compute the same feature sets—calendar signatures, lag stacks, rolling stats—across notebooks, pipelines, and model retrains.

The feature store lets them register those transforms once, materialize them to disk or a shared URI, and reload them instantly later. That avoids re-running expensive calculations, keeps metadata/versioning consistent, and makes it easy to assemble feature matrices across multiple transforms for downstream modeling.

2 Benefits

  • Avoid repeated work – cache signatures, lag stacks, rolling stats, and any custom transform.
  • Share across teams – store artifacts on a shared file system or S3/GCS/Azure using pyarrow.fs.
  • Track metadata automatically – every build records parameters, row counts, hashes, timestamps, and version info.
  • Coordinate writers – optional file locks prevent conflicting writes when multiple jobs run the same pipeline.

3 Quickstart (Pandas)

Code
import pandas as pd
import pytimetk as tk
from pathlib import Path

sales = tk.load_dataset("bike_sales_sample", parse_dates=["order_date"])

feature_store_root = Path("feature-store-demo").resolve()
store = tk.FeatureStore(root_path=feature_store_root)

store.register(
    "sales_signature",
    lambda df: tk.augment_timeseries_signature(
        df,
        date_column="order_date",
        engine="pandas",
    ),
    default_key_columns=("order_id",),
    description="Calendar features for order history.",
)

signature = store.build("sales_signature", sales, return_engine="pandas")
signature.from_cache, signature.metadata.row_count, signature.metadata.column_count
(True, 2466, 42)

Run the notebook a second time and the store will detect the same data + parameters and serve a cached artifact:

Code
signature_cached = store.build("sales_signature", sales, return_engine="pandas")
signature_cached.from_cache
True

4 Reusing Features Later

Copied metadata and artifacts can be loaded in another process by pointing to the same catalog folder:

Code
fresh_store = tk.FeatureStore(root_path=store.root_path)
reloaded = fresh_store.load("sales_signature", return_engine="pandas")
reloaded.metadata.version, reloaded.metadata.storage_path
('1b8b4a5044cb',
 '/Users/mdancho/Desktop/software/pytimetk/docs/guides/feature-store-demo/sales_signature/1b8b4a5044cb/features.parquet')

5 Inspecting the Catalog

Code
store.list_feature_sets()
name version cache_key storage_path storage_format storage_backend artifact_uri created_at data_fingerprint transform_fingerprint ... transform_name transform_kwargs pytimetk_version package_versions tags description key_columns row_count column_count extra_metadata
1 daily_signature b5737af06b34 b5737af06b34b585fd87b7ce19621ad77269a1a112d3c5... /Users/mdancho/Desktop/software/pytimetk/docs/... parquet local /Users/mdancho/Desktop/software/pytimetk/docs/... 2025-10-13T00:50:11.564352+00:00 58f3182c6e1e814ebb65cdddeed4f694e087fcb429fb86... 232245564d3e5f4368f45e100ad3cf910c2c5f2435bd3f... ... <lambda> {} 1.2.5.9000 {'pandas': '2.2.3', 'polars': '1.32.3'} () Calendar signatures on daily revenue. (order_date,) 177 31 {}
0 sales_signature 1b8b4a5044cb 1b8b4a5044cb0c642be6f3711fecff93dad660c789a338... /Users/mdancho/Desktop/software/pytimetk/docs/... parquet local /Users/mdancho/Desktop/software/pytimetk/docs/... 2025-10-13T00:50:11.497343+00:00 58f3182c6e1e814ebb65cdddeed4f694e087fcb429fb86... 072694e4aac0e5fb6f354672b1a1125da9bf127a623db0... ... <lambda> {} 1.2.5.9000 {'pandas': '2.2.3', 'polars': '1.32.3'} () Calendar features for order history. (order_id,) 2466 42 {}
2 seasonal_fourier 09c5058b609f 09c5058b609f7e3ba7ba85c1e70d4247041e732f7256e6... /Users/mdancho/Desktop/software/pytimetk/docs/... parquet local /Users/mdancho/Desktop/software/pytimetk/docs/... 2025-10-13T00:50:11.571455+00:00 58f3182c6e1e814ebb65cdddeed4f694e087fcb429fb86... 40d3ecfa320a4c7e379ea7adafd989c91533bdaafa6a98... ... <lambda> {} 1.2.5.9000 {'pandas': '2.2.3', 'polars': '1.32.3'} () None (order_date,) 177 8 {}

3 rows × 21 columns

Code
store.describe("sales_signature")
FeatureSetMetadata(name='sales_signature', version='1b8b4a5044cb', cache_key='1b8b4a5044cb0c642be6f3711fecff93dad660c789a3382cbf74c089e5e3be49', storage_path='/Users/mdancho/Desktop/software/pytimetk/docs/guides/feature-store-demo/sales_signature/1b8b4a5044cb/features.parquet', storage_format='parquet', storage_backend='local', artifact_uri='/Users/mdancho/Desktop/software/pytimetk/docs/guides/feature-store-demo', created_at='2025-10-13T00:50:11.497343+00:00', data_fingerprint='58f3182c6e1e814ebb65cdddeed4f694e087fcb429fb86c8ad3cc634d3355f23', transform_fingerprint='072694e4aac0e5fb6f354672b1a1125da9bf127a623db084a6a8c9c7f39cb202', transform_module='__main__', transform_name='<lambda>', transform_kwargs={}, pytimetk_version='1.2.5.9000', package_versions={'pandas': '2.2.3', 'polars': '1.32.3'}, tags=(), description='Calendar features for order history.', key_columns=('order_id',), row_count=2466, column_count=42, extra_metadata={})

6 Using the .tk Accessor (Polars)

Code
import polars as pl

pl_sales = pl.from_pandas(sales)

accessor = pl_sales.tk.feature_store(root_path=feature_store_root)

def centered(data: pl.DataFrame) -> pl.DataFrame:
    return data.with_columns(
        (pl.col("total_price") - pl.col("total_price").mean()).alias("total_price_centered"),
    )

accessor.register("sales_centered", centered, default_key_columns=("order_id",))

centered_features = accessor.build("sales_centered")
centered_features.from_cache, centered_features.metadata.storage_backend
(False, 'local')

7 Remote Artifact Storage

Artifacts can live in any pyarrow filesystem. Point artifact_uri at your bucket, folder, or lakehouse and the catalog will continue to live locally.

Code
artifact_uri = Path("remote-artifacts").resolve().as_uri()

s3_store = tk.FeatureStore(
    root_path=feature_store_root / "remote-catalog",  # metadata & locks
    artifact_uri=artifact_uri,            # where feature files live
    lock_timeout=60.0,                    # optional: adjust lock timing
)

s3_store.register(
    "sales_rolling",
    lambda df: tk.augment_rolling(
        df,
        date_column="order_date",
        value_column="total_price",
        window=7,
        window_func=["mean", "max"],
        engine="pandas",
    ),
    default_key_columns=("order_id",),
)

rolling = s3_store.build("sales_rolling", sales, return_engine="pandas")
rolling.metadata.storage_backend, rolling.metadata.storage_path
('pyarrow',
 'file:///Users/mdancho/Desktop/software/pytimetk/docs/guides/remote-artifacts/sales_rolling/644b3b483234/features.parquet')

If the URI uses the file:// scheme the same mechanism works for network file shares or mounted blob storage.

8 Coordinated Writes with Locks

By default, FeatureStore writes lock files under <root_path>/.locks so parallel jobs do not trample each other. Locks are scoped per feature name:

Code
store.enable_locking, (store.root_path / ".locks").iterdir()
(True, <generator object Path.iterdir at 0x106482dc0>)

Disable locking if your workflow guarantees exclusive writers:

Code
no_lock_store = tk.FeatureStore(root_path="tmp-store", enable_locking=False)

9 Assemble Multiple Feature Sets

Code
# Aggregate to a daily view so Fourier terms see a regular cadence.
store.register(
    "daily_signature",
    lambda df: (
        df.groupby("order_date", as_index=False)
        .agg({"total_price": "sum"})
        .augment_timeseries_signature(
            date_column="order_date",
            engine="pandas",
        )
    ),
    default_key_columns=("order_date",),
    description="Calendar signatures on daily revenue.",
)

store.register(
    "seasonal_fourier",
    lambda df: (
        df.groupby("order_date", as_index=False)
        .agg({"total_price": "sum"})
        .augment_fourier(
            date_column="order_date",
            periods=7,
            max_order=3,
            engine="pandas",
        )
    ),
    default_key_columns=("order_date",),
)

store.build("daily_signature", sales, return_engine="pandas")
store.build("seasonal_fourier", sales, return_engine="pandas")

assembled = store.assemble(
    ["daily_signature", "seasonal_fourier"],
    join_keys=("order_date",),
    return_engine="pandas",
)
assembled.metadata.row_count, assembled.metadata.column_count
(177, 38)

The assembled result is in-memory only (no artifact written), but includes metadata so you can track how it was produced.

10 MLflow Integration

Record feature versions alongside your models with MLflow so every run knows exactly which cached artifact it depended on.

Code
try:
    import mlflow

    from pytimetk.feature_store import (
        build_features_with_mlflow,
        load_features_from_mlflow,
    )
except Exception as exc:
    print("Skipping MLflow example:", exc)
else:
    mlflow_tracking = Path("mlruns").resolve().as_uri()
    mlflow.set_tracking_uri(mlflow_tracking)

    with mlflow.start_run() as train_run:
        signature_mlflow = build_features_with_mlflow(
            store,
            "sales_signature",
            sales,
            params_prefix="train",
            return_engine="pandas",
        )
        signature_mlflow.metadata.version

    reloaded = load_features_from_mlflow(
        store,
        "sales_signature",
        run_id=train_run.info.run_id,
        params_prefix="train",
        return_engine="pandas",
    )
    reloaded.metadata.version

11 Cleaning Up

Remove a feature set (optionally keeping the cached artifact):

Code
# Remove the Polars feature set via its accessor store.
accessor.store.drop("sales_centered")        # deletes metadata and artifact
# Drop a pandas feature set but keep the cached artifact for reuse.
store.drop("sales_signature", delete_artifact=False)

For remote backends the catalog metadata remains local, while the artifacts are removed from the remote filesystem via pyarrow.fs.