Skip to main content

ML Observability — Core Concepts


1. Data Drift and Concept Drift

What is the difference between data drift and concept drift?

DefinitionWhat changes
Data driftChange in the distribution of input features P(X)The inputs themselves
Concept driftChange in the statistical relationship between inputs and target P(Y|X)What the inputs mean for the prediction

A model can experience data drift without concept drift (new users with different demographics, same behavior patterns) or concept drift without data drift (same inputs, but user intent has shifted — e.g. a query that meant one thing in 2022 means something different in 2024).


How do you detect drift for high-dimensional or categorical features?

Kolmogorov-Smirnov (KS) test — measures the maximum distance between two cumulative distribution functions. Good for univariate feature drift.

from scipy.stats import ks_2samp

stat, p_value = ks_2samp(train_values, serving_values)
drifted = p_value < 0.05

Wasserstein distance (Earth Mover's Distance) — measures how much "work" it takes to transform one distribution into another. More sensitive to subtle shifts than KS.

from scipy.stats import wasserstein_distance

distance = wasserstein_distance(train_values, serving_values)

How do you differentiate between an anomaly, an outage, and gradual drift?

TypePatternDetection approach
AnomalyPoint-in-time spike or errorStatic threshold alerting, z-score
OutageSudden drop to zero in traffic or data flowHeartbeat checks, missing data alerts
Gradual driftSlow, systemic shift in underlying dataMoving average baselines, PSI over rolling windows

Observability tools distinguish these by combining alerting thresholds (for anomalies and outages) with trend detection over time (for drift). A spike that resolves in minutes is an anomaly. A metric that has been slowly declining over two weeks is drift.


How do you handle drift once it's detected?

Responses scale with severity:

PSI < 0.1 → no action, continue monitoring
PSI 0.1–0.2 → increase alert frequency, investigate root cause
PSI > 0.2 → trigger automated response

Automated responses include:

  1. Fallback to a rule-based system — remove the model from the serving path until the issue is resolved
  2. Route traffic to a previous stable version — canary rollback to the last known-good model artifact
  3. Trigger an automated retraining pipeline — retrain on the most recent data window and promote through the standard validation gate

Don't automatically retrain on drifted data without first validating that the drift is real signal, not a pipeline bug. Retraining on corrupted features makes the problem worse.


2. Metrics and Telemetry — The Three Pillars

What are the three pillars of ML observability?

Mirroring the three pillars of DevOps observability, but extended for ML:

1. Metrics

Aggregated, time-series data that summarizes system behavior.

# Examples
- inference_latency_p99_ms
- request_throughput_rps
- gpu_utilization_pct
- prediction_confidence_mean
- feature_null_rate_pct

2. Logs

Immutable event records capturing system state at specific timestamps.

{
"timestamp": "2026-05-28T10:13:26Z",
"model_version": "2.4.1",
"request_id": "req_abc123",
"input_features": { "user_tenure_days": 42, "device_type": "mobile" },
"prediction": 0.87,
"latency_ms": 34
}

3. Traces

End-to-end tracing of a single prediction request as it flows through the ML pipeline — from feature fetch → preprocessing → inference → postprocessing.

trace_id: abc123
├── feature_store_fetch 12ms
├── preprocessing 4ms
├── model_inference 28ms ← bottleneck
└── postprocessing 2ms
total: 46ms

Traces are the most underused pillar in ML. When a p99 latency alert fires, metrics tell you something is slow. Traces tell you exactly which stage in the pipeline is responsible.


What is the difference between monitoring and observability?

MonitoringObservability
Question answeredIs the system failing?Why is the system failing?
ApproachAlert when a known metric crosses a thresholdInspect internal state to diagnose unknown failures
Requires new code?No — dashboards and alerts on existing signalsNo — infer behavior from existing outputs
ML exampleAlert when accuracy drops below 92%Determine whether the drop is caused by drift, a pipeline bug, or a data quality issue

Monitoring tells you when a system fails. Observability lets you infer why it failed by inspecting its internal state — without deploying new code.


How do you set up an SLO for an ML model?

An ML SLO should cover both the serving system and the model behavior:

# Example ML SLO definition
slos:
serving:
- name: inference_latency
objective: 95% of predictions returned in under 50ms
window: rolling_7d

model_quality:
- name: f1_score_stability
objective: F1 score does not drop by more than 2% vs baseline
window: rolling_7d
baseline: weekly_eval_snapshot

- name: prediction_confidence
objective: Mean confidence stays within 1 stddev of training baseline
window: rolling_24h

data_quality:
- name: feature_completeness
objective: Null rate on critical features stays below 1%
window: rolling_1h

Start with three SLOs: one for latency, one for model quality, one for data quality. Add more only after you've operationalized alerting and on-call response for the first three. Unenforced SLOs are worse than none — they erode trust in the observability system.