Open-Domain Evaluation Worksheet for Teams

Open-domain systems do not fail because they are unintelligent. They fail because teams evaluate fluency instead of reliability. If your assistant sounds polished but cites weak evidence, ignores ambiguity, or fails silently, you do not have a capability problem. You have an evaluation problem. This worksheet fixes that.

What This Worksheet Is For

Use it to evaluate tasks where:

goals are ambiguous,
evidence is distributed across multiple sources,
and correctness is not binary.

Examples:

policy interpretation,
market research synthesis,
incident reconstruction,
cross-document analysis.

Evaluation Setup (10 Minutes)

Before scoring, define:

Task statement (one sentence)
Required outputs (format + decision use)
Allowed sources (internal/external)
Failure policy (when to escalate to human)

If this setup is unclear, your scores will be noise.

Core Rubric (Score 1-5)

Dimension	1 (Weak)	3 (Mixed)	5 (Strong)
Goal Clarification	Assumes intent without checks	Partial clarification	Explicitly clarifies ambiguities
Retrieval Quality	Misses key sources	Finds some relevant sources	Retrieves broad, relevant evidence
Evidence Attribution	Claims not traceable	Partial citation coverage	All key claims traceable
Conflict Handling	Ignores conflicting evidence	Notes conflicts without resolution	Surfaces and reconciles conflicts
Uncertainty Signaling	Overconfident tone	Occasional caveats	Clear confidence/uncertainty boundaries
Actionability	Generic summary	Some actionable outputs	Decision-ready recommendations
Recovery Behavior	Silent failure	Manual intervention required	Clear fallback/escalation path

Total score: 7-35.

Stress Inputs (Required)

For each task, run at least three stress conditions:

missing key document,
contradictory source pair,
outdated but plausible reference.

A workflow that only works in clean conditions is not open-domain ready.

Scoring Notes That Matter

Do not only store numbers. Capture evidence for each score:

"Which claim lacked support?"
"Where was uncertainty hidden?"
"Which source should have been retrieved but was not?"

Narrative notes turn scoring into engineering work.

Team Roles

Use three-role evaluation for better objectivity:

Operator: runs the task.
Auditor: scores rubric independently.
Reviewer: decides one improvement for next iteration.

Role separation reduces self-confirmation bias.

Improvement Loop

After each run:

Identify lowest two dimensions.
Select one targeted intervention.
Re-run same task after change.
Compare deltas, not absolute perfection.

Interventions may include:

better retrieval filters,
mandatory citation formatting,
uncertainty thresholds,
explicit refusal policy for unsupported claims,
fallback routing to human review.

Worksheet Template

Task:
Owner:
Date:

Setup:
- Required output:
- Allowed sources:
- Failure policy:

Rubric Scores:
- Goal Clarification:
- Retrieval Quality:
- Evidence Attribution:
- Conflict Handling:
- Uncertainty Signaling:
- Actionability:
- Recovery Behavior:

Stress Inputs Used:
- Missing source test:
- Conflict test:
- Outdated source test:

Lowest Dimensions:
Intervention This Sprint:
Re-test Date:

Maturity Bands

Use the following bands to interpret your score:

7-16: Fragile (demo quality only)
17-25: Emerging (needs safeguards)
26-31: Strong baseline (production with monitoring)
32-35: Advanced (well-instrumented, resilient)

Final Principle

Open-domain capability is not a model attribute. It is a systems property shaped by retrieval quality, scoring rigor, and recovery design. If you evaluate what matters, you can improve what matters. If you only evaluate fluency, you will keep shipping polished failure.