Skip to main content

Feature drift

Feature drift is when the statistical properties of one or more individual input features change over time in production. The feature pipeline keeps running, values keep arriving — but the distribution of those values no longer matches what the model was trained on.

A user_tenure_days feature that averaged 180 days during training might average 12 days six months later as the product acquired a surge of new users. The model never saw that pattern. It silently degrades.


Feature drift vs data drift

Data drift is the umbrella term — it means the overall input distribution P(X) has shifted. Feature drift is the same concept examined at the individual feature level.

ScopeWhat it tells you
Data driftEntire input space P(X)Something is wrong with the inputs
Feature driftIndividual feature P(x_i)Which specific features are responsible and by how much
Concept driftOutput relationship P(Y|X)Same inputs now mean something different for the prediction
Label driftTarget distribution P(Y)Class balance has changed (e.g. fraud rate spikes seasonally)

Data drift tells you something is wrong. Feature drift analysis tells you what and where. You can't fix data drift without first identifying which features drifted.

You can have data drift without concept drift — new users with different demographics but the same behavioral patterns. You can also have concept drift without feature drift — the same inputs, but user intent has shifted (a query that meant one thing in 2022 means something different today).


What feature drift looks like

When a feature drifts, its serving-time distribution separates from its training-time distribution. The model was optimized for the training shape — so the further the serving distribution moves, the less reliable its predictions become.

Typical causes:

  • Seasonal patterns — a days_since_last_purchase feature behaves differently in December vs March
  • Product changes — a new onboarding flow changes the distribution of user_tenure_days for new cohorts
  • Upstream schema changes — a partner data feed silently changes encoding (e.g. null becomes 0), shifting the null rate and mean simultaneously
  • Population shift — the model was trained on power users; it's now serving casual users with a completely different engagement profile

How to monitor feature drift

The three core metrics

The industry default for threshold-based alerting. Produces a single actionable number. Originally designed for credit risk modelling — widely adopted in financial ML and ad systems for exactly the "is this population still the same?" question.

def compute_psi(expected, actual, buckets=10):
import numpy as np
breakpoints = np.percentile(expected, np.linspace(0, 100, buckets + 1))
exp_pct = np.histogram(expected, bins=breakpoints)[0] / len(expected)
act_pct = np.histogram(actual, bins=breakpoints)[0] / len(actual)
exp_pct = np.where(exp_pct == 0, 1e-6, exp_pct)
act_pct = np.where(act_pct == 0, 1e-6, act_pct)
return np.sum((act_pct - exp_pct) * np.log(act_pct / exp_pct))
PSIStatusAction
< 0.1StableNo action. Continue scheduled monitoring.
0.1 – 0.2MonitorInvestigate root cause. Increase alert frequency.
> 0.2ActTrigger retrain pipeline or fallback to rule-based system.

For categorical features

PSI works on bucketed distributions. For raw categorical features, use KL divergence or Jensen-Shannon divergence:

from scipy.stats import entropy
import numpy as np

def js_divergence(p, q):
"""Jensen-Shannon divergence — symmetric, bounded [0, 1]."""
m = 0.5 * (p + q)
return 0.5 * entropy(p, m) + 0.5 * entropy(q, m)

# Build category frequency distributions
train_freq = train_df['device_type'].value_counts(normalize=True)
serve_freq = serve_df['device_type'].value_counts(normalize=True)

# Align categories
all_cats = train_freq.index.union(serve_freq.index)
p = train_freq.reindex(all_cats, fill_value=1e-6).values
q = serve_freq.reindex(all_cats, fill_value=1e-6).values

jsd = js_divergence(p, q)

The monitoring pipeline

Run drift checks on a rolling window — hourly for high-stakes features, daily for stable ones. Baseline is captured at training time and stored as a reference snapshot.

from scipy.stats import ks_2samp

def run_drift_check(baseline: dict, window: str = "24h") -> list[dict]:
"""
Run PSI + KS drift check for every feature.
Returns list of alerts sorted by PSI descending.
"""
alerts = []

for feature_name, train_dist in baseline.items():
serve_dist = fetch_serving_window(feature_name, window=window)

psi = compute_psi(train_dist, serve_dist)
ks_stat, p_val = ks_2samp(train_dist, serve_dist)

severity = (
"critical" if psi > 0.2 else
"warning" if psi > 0.1 else
"ok"
)

alerts.append({
"feature": feature_name,
"psi": round(psi, 4),
"ks_stat": round(ks_stat, 4),
"p_value": round(p_val, 4),
"severity": severity,
})

# Worst-drifting features at the top of the on-call queue
return sorted(alerts, key=lambda x: x["psi"], reverse=True)

The output feeds a feature health dashboard — sorted by PSI descending so the worst-drifting features are always at the top of the on-call queue.


Response playbook

PSI < 0.1
→ no action, continue monitoring

PSI 0.1 – 0.2
→ increase check frequency (hourly instead of daily)
→ investigate upstream: schema changes, pipeline bugs, population shift
→ notify model owner

PSI > 0.2
→ option 1: fallback to rule-based system (remove model from serving path)
→ option 2: canary rollback to last known-good model artifact
→ option 3: trigger automated retraining on most recent data window
then promote through standard validation gate

Don't automatically retrain on drifted data without first validating that the drift is real signal, not a pipeline bug. Retraining on corrupted features makes the problem worse.


Quick reference

MetricBest forThreshold
PSIAll features; categorical + continuous> 0.1 warn, > 0.2 act
KS testContinuous; shape-sensitive detectionp < 0.05 act
WassersteinMultimodal or heavy-tailed continuousFeature-scale relative
JS divergenceCategorical with many categories> 0.1 investigate
Mean / std deltaQuick sanity check on any numeric feature> 2 stddev investigate