Drift vs. Regression: Understanding AI Failures

People use "the model got worse" to describe two completely different events. We keep them separate, because the eval that catches one is blind to the other. Here is the drawing we put on the wall when someone conflates them.

Left — quality falls off a cliff at one release. Right — quality slides for weeks with no release at all.

Regression is an event.

A regression has a timestamp. Someone changed a prompt, a retrieval index, a model version, a downstream parser — and the system that worked on Tuesday is wrong on Wednesday. The shape on the graph is a cliff. The cause is inside your repository.

The eval for this is a gate. You run a fixed, frozen test set against every candidate release and you block the deploy if the score drops. The set never changes, because the whole point is to detect that you changed. A frozen suite is a feature here, not a flaw.

Drift is a process.

Drift has no timestamp. Nobody touched the system. The inputs changed — new slang, a new product line, a regulation that reworded the forms your model reads, a vendor who quietly retired an upstream model. The shape on the graph is a slope. The cause is outside your repository.

A frozen test set is blind to this. It still passes, every day, while real accuracy bleeds out in production. To catch drift you need a living signal — sampled production traffic, scored on a rolling window, watched for slow movement rather than a single bad commit.

Regression asks "did we break it?" Drift asks "is it still the right tool for a world that moved?" Same symptom, opposite questions.

Two failures, two evals.

Regression — caught at release, by a frozen gate, before customers see it. Cheap to fix: revert the change.
Drift — caught in production, by a rolling monitor, after the world has moved. Expensive to fix: re-ground the system in current reality.

The trap is using one eval and assuming it covers both. A team with a strong release gate and no drift monitor feels safe right up until a quiet six-week decline nobody deployed. A team watching only production trends can't tell a bad commit from a bad week.

We run both, and we keep them labelled. Boring, redundant, and the reason a regression never reaches a customer and a drift never surprises us for long.

— Silicon Prime team. May 2026.

Drift vs regression — two failures, two evals.

Regression is an event.

Drift is a process.

Two failures, two evals.

Comments