Evals 101 — a taxonomy you can actually use.

Behavioral, regression, drift, and red-team evals — drawn out and labelled. Four boxes, four jobs, and what each one actually catches.

People say "we run evals" the way they say "we have tests." It means nothing until you say which kind. There are four that matter, and they catch different failures. Here they are, drawn out.

EVAL TAXONOMY 01 / BEHAVIORAL Does it do the job? Catches: wrong answers on the tasks you actually ship. The baseline. 02 / REGRESSION Did the last change break it? Catches: a fix or model swap that silently degrades a passing case. 03 / DRIFT Is reality moving under us? Catches: inputs and the world shifting while the model stays still. 04 / RED-TEAM What breaks it on purpose? Catches: abuse, injection, the inputs a hostile user reaches for. RUN EVERY RELEASE
Four eval types, the question each answers, and the failure each is built to catch.

The top row is about your code.

Behavioral is the baseline: does the system do the job on the tasks you actually ship? For a claims-triage model that means a fixed set of real claims with known-correct dispositions. It is the eval people build first and the one they over-trust — a high behavioral score tells you the system works on the cases you thought to write down, and nothing about the cases you didn't.

Regression asks a narrower, meaner question: did the change I am about to ship quietly break something that used to work? You swap a model version to fix a billing-code edge case, the eval passes the new case — and three unrelated cases that passed last week now fail. Without a regression suite you ship the trade and find out from a customer. The discipline that makes it work is a frozen set: the cases never change, because the entire point is to detect that the system changed, not the test.

The bottom row is about the world.

Drift is the one teams skip, because nothing in the repo moved. The model is byte-for-byte identical to last quarter and its accuracy is bleeding out anyway — a new product line nobody trained on, slang the classifier never saw, a form a regulator quietly reworded. A frozen suite is blind to this; it passes every day while production rots. Drift needs a living signal: sampled real traffic, scored on a rolling window, watched for a slope instead of a cliff.

Red-team assumes a person is actively trying to make the system misbehave — prompt injection through a pasted document, a jailbreak that talks the model out of its guardrails, an input crafted to exfiltrate another tenant's data. Behavioral asks "does it work for a cooperative user?" Red-team asks "what does it do for a hostile one?" The two find completely different failures, and a system that aces the first can fail the second on its first day in public.

Most teams run one of these — usually behavioral — and call it "evals." The gaps between the four boxes are where incidents live.

Where each one runs.

  • Behavioral & regression live in the release loop, as a gate. No green, no ship. They run on every candidate build, before a human ever sees it.
  • Drift runs on a schedule against live traffic, not on a commit — daily or hourly, never tied to a deploy, because the thing it watches has no deploy.
  • Red-team runs before launch and again every time the threat model changes: a new input channel, a new integration, a new class of user.

The one rule that makes any of it real.

A failing eval blocks something. Behavioral and regression block the release. Drift opens an incident. Red-team blocks the launch. If a number can drop without anyone being stopped or paged, it is not an eval — it is a chart, and charts have never once prevented an outage.

— Silicon Prime team. May 2026.

All posts Read next: From the Aegis war room (the one we don't use anymore)

Comments