Most AI teams know their demo quality. Very few know their reliability profile. That gap is expensive. This audit gives you a fast way to measure where your workflow fails under real conditions and choose one fix that reduces risk immediately.
When To Use This
Run this audit when:
- a workflow feels "mostly good" but breaks unpredictably,
- user trust is dropping despite decent output quality,
- or you are about to ship changes and want a sanity check.
Time budget: 15 minutes.
Step 1: Pick One Real Workflow (2 Minutes)
Choose one concrete path, not a generic capability. Examples:
- "Support assistant drafts a customer response from ticket + knowledge base."
- "Research assistant summarizes three documents with citations."
- "Ops assistant extracts structured fields from incident notes."
Narrow scope improves audit signal.
Step 2: Run Five Messy Inputs (5 Minutes)
Do not use polished prompts. Use realistic inputs:
- missing context,
- ambiguous request,
- contradictory information,
- outdated reference,
- malformed formatting.
Capture outputs without editing.
Step 3: Score the Workflow (5 Minutes)
Use 1-5 scoring for each dimension.
| Dimension | 1 (Weak) | 3 (Mixed) | 5 (Strong) |
|---|---|---|---|
| Input Robustness | Breaks on minor noise | Handles moderate mess | Stable under noisy inputs |
| Output Contract | Often malformed/unusable | Usually usable with cleanup | Consistent, machine-usable |
| Evidence Quality | Claims without support | Partial evidence | Clear, verifiable support |
| Failure Visibility | Silent failure | Some signals | Clear uncertainty/failure signals |
| Recovery Path | No fallback | Manual ad-hoc recovery | Defined fallback path |
Total score range: 5-25.
Step 4: Classify Risk Band (1 Minute)
In this step, use the following checklist:
- 5-12: High risk (not production reliable)
- 13-18: Moderate risk (needs guardrails)
- 19-25: Good baseline (continue hardening)
The total matters less than your lowest individual dimension.
Step 5: Apply One Guardrail (2 Minutes)
Pick one improvement you can implement this week:
- schema validation + retry on invalid structure,
- explicit citation requirement for key claims,
- confidence threshold that triggers human review,
- fallback template when retrieval quality is low,
- uncertainty banner instead of fabricated certainty.
One guardrail is better than five planned guardrails.
Common Failure Signatures
If you see these patterns, your workflow is not reliability-ready:
- Fluent wrong answers with no warning.
- Correct format but unsupported claims.
- Inconsistent behavior on near-identical prompts.
- Hidden failures that look successful at a glance.
Treat each as a systems issue, not a prompt-writing issue.
Team Version (Optional)
Run this as a 20-minute team ritual:
- One person submits inputs.
- One person scores independently.
- One person selects the guardrail owner.
Repeat weekly on critical workflows. Reliability compounds when audit cadence is consistent.
Output Template
Copy this into your notes:
Workflow:
Owner:
Date:
Scores:
- Input Robustness:
- Output Contract:
- Evidence Quality:
- Failure Visibility:
- Recovery Path:
Lowest Dimension:
Risk Band:
Guardrail This Week:
Expected Impact:
The Point
This is not a compliance exercise. It is a fast way to convert vague concern into concrete action. If your system can fail, make failure visible. If failure is visible, make recovery explicit. If recovery is explicit, users can trust the workflow even when it is not perfect.
That is reliability in practice.
