Drift vs. Regression: Understanding AI Failures

Failure Type	Detection	Evaluation Method	Fix Complexity
Regression	At release	Frozen gate	Cheap
Drift	In production	Rolling monitor	Expensive

Regression is an event with a timestamp: someone changed a prompt, retrieval index, model version, or parser, and a system that worked Tuesday is wrong Wednesday—the cause is inside your repository and looks like a cliff on the graph. Drift is a process with no timestamp: nobody touched the system, but the inputs changed and accuracy slowly bleeds out—the cause is outside your repository and looks like a slope.

A gate. You run a fixed, frozen test set against every candidate release and block the deploy if the score drops. The set never changes, because the whole point is to detect that you changed something. As the post notes, a frozen suite is a feature here, not a flaw—it gives you a stable reference to prove the system still behaves as before.

Because drift comes from the world changing, not your code. A frozen set still passes every day while real accuracy bleeds out in production—new slang, a new product line, a reworded regulation, or a quietly retired upstream model. To catch drift you need a living signal: sampled production traffic, scored on a rolling window, watched for slow movement rather than a single bad commit.

Assuming one eval covers both. A team with a strong release gate and no drift monitor feels safe right up until a quiet six-week decline nobody deployed. A team watching only production trends can't tell a bad commit from a bad week. Regression and drift are the same symptom asking opposite questions—'did we break it?' versus 'is it still the right tool for a world that moved?'—so they need two evals.

Regression is detected at release via a frozen gate and is cheap to fix because the cause is a specific change inside your repository. Drift is detected in production via a rolling monitor and is expensive to fix because the world moved and the system must be re-fitted to new inputs. The article summarizes this in a table contrasting detection point, evaluation method, and fix complexity.

They run both evals and keep them clearly labelled—a frozen release gate plus a rolling production monitor. The post calls this boring and redundant, and that's the point: it's the reason a regression never reaches a customer and a drift never surprises them for long. Each eval answers a different question, so neither substitutes for the other.

Changes outside your repository that shift the inputs: new slang in user language, a new product line the model hasn't seen, a regulation that reworded the forms your model reads, or a vendor who quietly retired an upstream model. Nobody touched your system, yet accuracy slowly declines. Because there's no deploy event to point at, only a rolling production monitor reveals the slope.

Drift vs regression — two failures, two evals.

Regression is an event. 📉

Drift is a process. 🌊

Two failures, two evals. ⚖️

Further Reading

🚀 Ready to Build with AI?

Frequently asked questions

Ready to turn AI experiments into measurable ROI?

Comments