Shipping LLM apps without shipping LLM bugs

Why most LLM apps regress silently

Traditional software has a relatively simple compact between developer and runtime: code does what it says. Language models break that compact. The same prompt can return different output across calls, across model versions, and across the boundary of "we changed the system prompt slightly on Tuesday."

If you don't have a habit of catching this drift, you'll ship regressions for months without knowing.

Three habits we make non-negotiable

1. Evals before features

Before any prompt or model change, we add to a small but growing eval set. The set is in the repo. CI runs it. PRs that lower the pass rate get a red comment. The eval set captures user-visible behavior, not abstract metrics.

2. Trace every call

Every model call is traced with the inputs, outputs, model id, prompt id, latency, and tokens. We use OpenTelemetry semantic conventions so tracing fits into our existing observability stack instead of being a separate AI dashboard.

3. Have a real fallback

If the model is down, slow, or the response fails validation, the user gets a graceful fallback — never a spinner that times out. For most apps the fallback is "show the input, let them retry, and surface a clear error." For high-stakes apps it's a deterministic path that produces a worse-but-correct answer.

The takeaway

Treat LLMs like external services. They're reliable enough to build on, and unreliable enough to need every monitoring habit you'd give a flaky third-party API.