Open-domain systems do not fail because they are unintelligent. They fail because teams evaluate fluency instead of reliability. If your assistant sounds polished but cites weak evidence, ignores ambiguity, or fails silently, you do not have a capability problem. You have an evaluation problem. This worksheet fixes that.

What This Worksheet Is For

Use it to evaluate tasks where:

  • goals are ambiguous,
  • evidence is distributed across multiple sources,
  • and correctness is not binary.

Examples:

  • policy interpretation,
  • market research synthesis,
  • incident reconstruction,
  • cross-document analysis.

Evaluation Setup (10 Minutes)

Before scoring, define:

  • Task statement (one sentence)
  • Required outputs (format + decision use)
  • Allowed sources (internal/external)
  • Failure policy (when to escalate to human)

If this setup is unclear, your scores will be noise.

Core Rubric (Score 1-5)

Dimension1 (Weak)3 (Mixed)5 (Strong)
Goal ClarificationAssumes intent without checksPartial clarificationExplicitly clarifies ambiguities
Retrieval QualityMisses key sourcesFinds some relevant sourcesRetrieves broad, relevant evidence
Evidence AttributionClaims not traceablePartial citation coverageAll key claims traceable
Conflict HandlingIgnores conflicting evidenceNotes conflicts without resolutionSurfaces and reconciles conflicts
Uncertainty SignalingOverconfident toneOccasional caveatsClear confidence/uncertainty boundaries
ActionabilityGeneric summarySome actionable outputsDecision-ready recommendations
Recovery BehaviorSilent failureManual intervention requiredClear fallback/escalation path

Total score: 7-35.

Stress Inputs (Required)

For each task, run at least three stress conditions:

  • missing key document,
  • contradictory source pair,
  • outdated but plausible reference.

A workflow that only works in clean conditions is not open-domain ready.

Scoring Notes That Matter

Do not only store numbers. Capture evidence for each score:

  • "Which claim lacked support?"
  • "Where was uncertainty hidden?"
  • "Which source should have been retrieved but was not?"

Narrative notes turn scoring into engineering work.

Team Roles

Use three-role evaluation for better objectivity:

  • Operator: runs the task.
  • Auditor: scores rubric independently.
  • Reviewer: decides one improvement for next iteration.

Role separation reduces self-confirmation bias.

Improvement Loop

After each run:

  1. Identify lowest two dimensions.
  2. Select one targeted intervention.
  3. Re-run same task after change.
  4. Compare deltas, not absolute perfection.

Interventions may include:

  • better retrieval filters,
  • mandatory citation formatting,
  • uncertainty thresholds,
  • explicit refusal policy for unsupported claims,
  • fallback routing to human review.

Worksheet Template

Task:
Owner:
Date:

Setup:
- Required output:
- Allowed sources:
- Failure policy:

Rubric Scores:
- Goal Clarification:
- Retrieval Quality:
- Evidence Attribution:
- Conflict Handling:
- Uncertainty Signaling:
- Actionability:
- Recovery Behavior:

Stress Inputs Used:
- Missing source test:
- Conflict test:
- Outdated source test:

Lowest Dimensions:
Intervention This Sprint:
Re-test Date:

Maturity Bands

Use the following bands to interpret your score:

  • 7-16: Fragile (demo quality only)
  • 17-25: Emerging (needs safeguards)
  • 26-31: Strong baseline (production with monitoring)
  • 32-35: Advanced (well-instrumented, resilient)

Final Principle

Open-domain capability is not a model attribute. It is a systems property shaped by retrieval quality, scoring rigor, and recovery design. If you evaluate what matters, you can improve what matters. If you only evaluate fluency, you will keep shipping polished failure.