Why most LLM apps regress silently
Traditional software has a relatively simple compact between developer and runtime: code does what it says. Language models break that compact. The same prompt can return different output across calls, across model versions, and across the boundary of "we changed the system prompt slightly on Tuesday."
If you don't have a habit of catching this drift, you'll ship regressions for months without knowing.
Three habits we make non-negotiable
1. Evals before features
Before any prompt or model change, we add to a small but growing eval set. The set is in the repo. CI runs it. PRs that lower the pass rate get a red comment. The eval set captures user-visible behavior, not abstract metrics.
2. Trace every call
Every model call is traced with the inputs, outputs, model id, prompt id, latency, and tokens. We use OpenTelemetry semantic conventions so tracing fits into our existing observability stack instead of being a separate AI dashboard.
3. Have a real fallback
If the model is down, slow, or the response fails validation, the user gets a graceful fallback — never a spinner that times out. For most apps the fallback is "show the input, let them retry, and surface a clear error." For high-stakes apps it's a deterministic path that produces a worse-but-correct answer.
The takeaway
Treat LLMs like external services. They're reliable enough to build on, and unreliable enough to need every monitoring habit you'd give a flaky third-party API.

