============================================================ nat.io // BLOG POST ============================================================ TITLE: Open-Domain Evaluation Worksheet for Teams DATE: February 5, 2026 AUTHOR: Nat Currier TAGS: AI Engineering, Evaluation, Large Language Models, Systems Design ------------------------------------------------------------ Open-domain systems do not fail because they are unintelligent. They fail because teams evaluate fluency instead of reliability. If your assistant sounds polished but cites weak evidence, ignores ambiguity, or fails silently, you do not have a capability problem. You have an evaluation problem. This worksheet fixes that. [ What This Worksheet Is For ] ------------------------------------------------------------ Use it to evaluate tasks where: - goals are ambiguous, - evidence is distributed across multiple sources, - and correctness is not binary. Examples: - policy interpretation, - market research synthesis, - incident reconstruction, - cross-document analysis. [ Evaluation Setup (10 Minutes) ] ------------------------------------------------------------ Before scoring, define: - Task statement (one sentence) - Required outputs (format + decision use) - Allowed sources (internal/external) - Failure policy (when to escalate to human) If this setup is unclear, your scores will be noise. [ Core Rubric (Score 1-5) ] ------------------------------------------------------------ | Dimension | 1 (Weak) | 3 (Mixed) | 5 (Strong) | | --- | --- | --- | --- | | Goal Clarification | Assumes intent without checks | Partial clarification | Explicitly clarifies ambiguities | | Retrieval Quality | Misses key sources | Finds some relevant sources | Retrieves broad, relevant evidence | | Evidence Attribution | Claims not traceable | Partial citation coverage | All key claims traceable | | Conflict Handling | Ignores conflicting evidence | Notes conflicts without resolution | Surfaces and reconciles conflicts | | Uncertainty Signaling | Overconfident tone | Occasional caveats | Clear confidence/uncertainty boundaries | | Actionability | Generic summary | Some actionable outputs | Decision-ready recommendations | | Recovery Behavior | Silent failure | Manual intervention required | Clear fallback/escalation path | Total score: 7-35. [ Stress Inputs (Required) ] ------------------------------------------------------------ For each task, run at least three stress conditions: - missing key document, - contradictory source pair, - outdated but plausible reference. A workflow that only works in clean conditions is not open-domain ready. [ Scoring Notes That Matter ] ------------------------------------------------------------ Do not only store numbers. Capture evidence for each score: - "Which claim lacked support?" - "Where was uncertainty hidden?" - "Which source should have been retrieved but was not?" Narrative notes turn scoring into engineering work. [ Team Roles ] ------------------------------------------------------------ Use three-role evaluation for better objectivity: - Operator: runs the task. - Auditor: scores rubric independently. - Reviewer: decides one improvement for next iteration. Role separation reduces self-confirmation bias. [ Improvement Loop ] ------------------------------------------------------------ After each run: 1. Identify lowest two dimensions. 2. Select one targeted intervention. 3. Re-run same task after change. 4. Compare deltas, not absolute perfection. Interventions may include: - better retrieval filters, - mandatory citation formatting, - uncertainty thresholds, - explicit refusal policy for unsupported claims, - fallback routing to human review. [ Worksheet Template ] ------------------------------------------------------------ ```text Task: Owner: Date: Setup: - Required output: - Allowed sources: - Failure policy: Rubric Scores: - Goal Clarification: - Retrieval Quality: - Evidence Attribution: - Conflict Handling: - Uncertainty Signaling: - Actionability: - Recovery Behavior: Stress Inputs Used: - Missing source test: - Conflict test: - Outdated source test: Lowest Dimensions: Intervention This Sprint: Re-test Date: ``` [ Maturity Bands ] ------------------------------------------------------------ Use the following bands to interpret your score: - 7-16: Fragile (demo quality only) - 17-25: Emerging (needs safeguards) - 26-31: Strong baseline (production with monitoring) - 32-35: Advanced (well-instrumented, resilient) [ Final Principle ] ------------------------------------------------------------ Open-domain capability is not a model attribute. It is a systems property shaped by retrieval quality, scoring rigor, and recovery design. If you evaluate what matters, you can improve what matters. If you only evaluate fluency, you will keep shipping polished failure.