Feature drift
Feature drift is when the statistical properties of one or more individual input features change over time in production. The feature pipeline keeps running, values keep arriving — but the distribution of those values no longer matches what the model was trained on.
A user_tenure_days feature that averaged 180 days during training might average
12 days six months later as the product acquired a surge of new users. The model
never saw that pattern. It silently degrades.
Feature drift vs data drift
Data drift is the umbrella term — it means the overall input distribution P(X) has shifted. Feature drift is the same concept examined at the individual feature level.
| Scope | What it tells you | |
|---|---|---|
| Data drift | Entire input space P(X) | Something is wrong with the inputs |
| Feature drift | Individual feature P(x_i) | Which specific features are responsible and by how much |
| Concept drift | Output relationship P(Y|X) | Same inputs now mean something different for the prediction |
| Label drift | Target distribution P(Y) | Class balance has changed (e.g. fraud rate spikes seasonally) |
Data drift tells you something is wrong. Feature drift analysis tells you what and where. You can't fix data drift without first identifying which features drifted.
You can have data drift without concept drift — new users with different demographics but the same behavioral patterns. You can also have concept drift without feature drift — the same inputs, but user intent has shifted (a query that meant one thing in 2022 means something different today).
What feature drift looks like
When a feature drifts, its serving-time distribution separates from its training-time distribution. The model was optimized for the training shape — so the further the serving distribution moves, the less reliable its predictions become.
Typical causes:
- Seasonal patterns — a
days_since_last_purchasefeature behaves differently in December vs March - Product changes — a new onboarding flow changes the distribution of
user_tenure_daysfor new cohorts - Upstream schema changes — a partner data feed silently changes encoding
(e.g.
nullbecomes0), shifting the null rate and mean simultaneously - Population shift — the model was trained on power users; it's now serving casual users with a completely different engagement profile
How to monitor feature drift
The three core metrics
The industry default for threshold-based alerting. Produces a single actionable number. Originally designed for credit risk modelling — widely adopted in financial ML and ad systems for exactly the "is this population still the same?" question.
def compute_psi(expected, actual, buckets=10):
import numpy as np
breakpoints = np.percentile(expected, np.linspace(0, 100, buckets + 1))
exp_pct = np.histogram(expected, bins=breakpoints)[0] / len(expected)
act_pct = np.histogram(actual, bins=breakpoints)[0] / len(actual)
exp_pct = np.where(exp_pct == 0, 1e-6, exp_pct)
act_pct = np.where(act_pct == 0, 1e-6, act_pct)
return np.sum((act_pct - exp_pct) * np.log(act_pct / exp_pct))
| PSI | Status | Action |
|---|---|---|
< 0.1 | Stable | No action. Continue scheduled monitoring. |
0.1 – 0.2 | Monitor | Investigate root cause. Increase alert frequency. |
> 0.2 | Act | Trigger retrain pipeline or fallback to rule-based system. |
For categorical features
PSI works on bucketed distributions. For raw categorical features, use KL divergence or Jensen-Shannon divergence:
from scipy.stats import entropy
import numpy as np
def js_divergence(p, q):
"""Jensen-Shannon divergence — symmetric, bounded [0, 1]."""
m = 0.5 * (p + q)
return 0.5 * entropy(p, m) + 0.5 * entropy(q, m)
# Build category frequency distributions
train_freq = train_df['device_type'].value_counts(normalize=True)
serve_freq = serve_df['device_type'].value_counts(normalize=True)
# Align categories
all_cats = train_freq.index.union(serve_freq.index)
p = train_freq.reindex(all_cats, fill_value=1e-6).values
q = serve_freq.reindex(all_cats, fill_value=1e-6).values
jsd = js_divergence(p, q)
The monitoring pipeline
Run drift checks on a rolling window — hourly for high-stakes features, daily for stable ones. Baseline is captured at training time and stored as a reference snapshot.
from scipy.stats import ks_2samp
def run_drift_check(baseline: dict, window: str = "24h") -> list[dict]:
"""
Run PSI + KS drift check for every feature.
Returns list of alerts sorted by PSI descending.
"""
alerts = []
for feature_name, train_dist in baseline.items():
serve_dist = fetch_serving_window(feature_name, window=window)
psi = compute_psi(train_dist, serve_dist)
ks_stat, p_val = ks_2samp(train_dist, serve_dist)
severity = (
"critical" if psi > 0.2 else
"warning" if psi > 0.1 else
"ok"
)
alerts.append({
"feature": feature_name,
"psi": round(psi, 4),
"ks_stat": round(ks_stat, 4),
"p_value": round(p_val, 4),
"severity": severity,
})
# Worst-drifting features at the top of the on-call queue
return sorted(alerts, key=lambda x: x["psi"], reverse=True)
The output feeds a feature health dashboard — sorted by PSI descending so the worst-drifting features are always at the top of the on-call queue.
Response playbook
PSI < 0.1
→ no action, continue monitoring
PSI 0.1 – 0.2
→ increase check frequency (hourly instead of daily)
→ investigate upstream: schema changes, pipeline bugs, population shift
→ notify model owner
PSI > 0.2
→ option 1: fallback to rule-based system (remove model from serving path)
→ option 2: canary rollback to last known-good model artifact
→ option 3: trigger automated retraining on most recent data window
then promote through standard validation gate
Don't automatically retrain on drifted data without first validating that the drift is real signal, not a pipeline bug. Retraining on corrupted features makes the problem worse.
Quick reference
| Metric | Best for | Threshold |
|---|---|---|
| PSI | All features; categorical + continuous | > 0.1 warn, > 0.2 act |
| KS test | Continuous; shape-sensitive detection | p < 0.05 act |
| Wasserstein | Multimodal or heavy-tailed continuous | Feature-scale relative |
| JS divergence | Categorical with many categories | > 0.1 investigate |
| Mean / std delta | Quick sanity check on any numeric feature | > 2 stddev investigate |